[SciPy-dev] Dataset for examples and license

David Cournapeau david@ar.media.kyoto-u.ac...
Tue Apr 24 21:06:28 CDT 2007


Robert Kern wrote:
>
> I'm fiddling around with a convention for data packages. Let's suppose we have a
> namespace package scipydata. Each data package would be a subpackage under
> scipydata. It would provide some conventionally-named metadata to describe the
> dataset (`__doc__` to describe the dataset in prose, `source`, `copyright`,
> etc.) and a load() callable that would load the dataset and return a dictionary
> with its data. The load() callable could do whatever it needs to load the data.
> It might just return objects that are defined in code (e.g. numpy.array([...]))
> if they are small enough. Or it might read a CSV, NetCDF4, or HDF5 file that is
> included in the package. Or it might download something from a website or FTP site.
>
> The scipydata.util package would provide some utilities to help writing
> scipydata packages. Particularly, it would be provide utilities to read some
> kind of configuration file or environment variable which establishes a cache
> directory such that large datasets can be downloaded from a website once and
> loaded from disk thereafter.
>
> The scipydata packages could then be distributed extremely easily as eggs, and
> getting your dataset would be as simple as
>
>   $ easy_install scipydata.cournapeaus_data
>
> Does that sound good to you?
I don't see any problem with that approach, and I am sure you know much 
better than me how to organize things for easy distribution. I think 
everybody agreeing on one file format is important (I have a preference 
for hdf5, since it is well supported under python through pytables, and 
has a full C api). For really small dataset, CSV could be OK.

Would scipydata be in scipy ? (I am asking again for license reasons :) ).

David



More information about the Scipy-dev mailing list