[SciPy-dev] Common data sets for testing purposes

Robert Kern robert.kern at gmail.com
Mon Jul 10 16:53:17 CDT 2006


David Huard wrote:
> Hi,
> 
> I think it would be useful if there were standard, common data sets 
> included in the scipy distribution (as in matlab or R). They could be 
> used to ease testing, the creation of demos or simply to give examples. 
> Also, if the data sets are chosen wisely, they could serve to attract 
> people from targeted discipline to scipy (IQ scores data won't attract 
> the same crowd as neutrino counts or distributed sea surface temperatures).

Good idea. The first step would be collecting some datasets and writing one 
scipy/matplotlib (dare I say Chaco?) example per dataset. As we write the 
examples, the idioms we use to access the data should come to the surface, and 
we can possible settle on a common data format and some utilities in scipy to 
make the demos accessible through a uniform interface (more or less; at the very 
least the file structure should settle out quickly: a README, example01.py, 
example02.py, plot01.png, data/*.dat, etc.).

I would prefer to keep the datasets out of the trunk and the distribution 
tarballs, though. The current download burden is somewhat heavy as it is, and 
some of the worthwhile datasets will probably be substantial in size. A few 
might be absorbed into the scipy trunk for use in unit tests or the (very 
lonely) tutorial. I suggest making a data/ directory in the repository sibling 
to branches/, tags/, and trunk/. I'll try to get around to it if no one beats me.

If you would like to start a Wiki page on www.scipy.org to collect pointers to 
useful datasets and example code, that would be great.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco


More information about the Scipy-dev mailing list