[SciPy-user] A first proposal for dataset organization
Tue Sep 18 20:12:05 CDT 2007
your proposal looks good and I think it's a great addition to SciPy. As for
the two issues you raise, here is my 2 cents.
I wouldn't bother too much about missing data. These data sets are mainly
for illustration and testing purposes. Hence, in general, we can choose data
sets that don't have missing data. Now, there should be a data set with
missing data to illustrate the use of masked arrays or statistical function
robust to NaNs, but it can be kept pretty simple, that is, just a single
For large data sets, I'm not sure I understand what you're meaning. Do you
intend to include netcdf or HDF5 files and provide an interface to access
those data sets so users don't have to bother about the underlying engine ?
Do we really want to distribute a package weighting > 1GB ?
2007/9/17, David Cournapeau <firstname.lastname@example.org>:
> Hi there,
> A few months ago, we started to discuss about various issues about
> dataset for numpy/scipy. In the context of my Summer Of Code for machine
> learning tools in python, I had the possibility to tackle concretely the
> issue. Before announcing a first alpha version of my work, I would like
> to gather comments, critics about the following proposal for dataset
> The following proposal is also available in svn:
> Dataset for scipy: design proposal
> One of the thing numpy/scipy is missing now is a set of datasets,
> available for
> demo, courses, etc. For example, R has a set of dataset available at the
> The expected usage of the datasets are the following:
> - machine learning: eg the data contain also class information
> (discrete or continuous)
> - descriptive statistics
> - others ?
> That is, a dataset is not only data, but also some meta-data. The goal
> of this
> proposal is to propose common practices for organizing the data, in a
> way which
> is both straightforward, and does not prevent specific usage of the data.
> A preliminary set of datasets is available at the following address:
> Each dataset is a directory and defines a python package (e.g. has the
> __init__.py file). Each package is expected to define the function load,
> the corresponding data. For example, to access datasets data1, you
> should be able to do:
> >>> from datasets.data1 import load
> >>> d = load() # -> d contains the data.
> load can do whatever it wants: fetching data from a file (python script,
> file, etc...), from the internet, etc... Some special variables must be
> for each package, containing a python string:
> - COPYRIGHT: copyright informations
> - SOURCE: where the data are coming from
> - DESCHOSRT: short description
> - DESCLONG: long description
> - NOTE: some notes on the datasets.
> Format of the data
> Here, I suggest a common practice for the returned value by the load
> Instead of using classes to provide meta-data, I propose to use a
> of arrays, with some values mandatory. The key goals are:
> - for people who just want the data, there is no extra burden
> give me the data !" MOTO).
> - for people who need more, they can easily extract what they
> need from
> the returned values. More high level abstractions can be built
> from this model.
> - all possible dataset should fit into this model.
> - In particular, I want to be able to be able to convert our
> dataset to
> Orange Dataset representation (or other machine learning
> tool), and
> For the datasets to be useful in the learn scikits, which is the project
> initiated this datasets package, the data returned by load has to be a
> with the following conventions:
> - 'data': this value should be a record array containing the actual
> - 'label': this value should be a rank 1 array of integers, contains
> label index for each sample, that is label[i] should be the label
> of data[i]. If it contains float values, it is used for regression
> - 'class': a record array such as class[i] is the class name. In other
> words, this makes the correspondance label name > label index.
> As an example, I use the famouse IRIS dataset: the dataset contains 3
> of flowers, and for each flower, 4 measures (called attributes in machine
> learning vocabulary) are available (sepal width and length, petal width
> length). In this case, the values returned by load would be:
> - 'data': a record array containing all the flowers'
> measurements. For
> descriptive statistics, that's all you may need. You can
> easily find
> the attributes from the dtype (a function to find the
> attributes is
> also available: it returns a list of the attributes).
> - 'labels': an array of integers (for class information) or
> float (for
> regression). each class is encoded as an integer, and labels[i]
> returns this integer for the sample i.
> - 'class': a record array, which returns the integer code for each
> class. For example, class['Iris-versicolor'] will return the
> used in label, and all samples i such as label[i] ==
> class['Iris-versicolor'] are of the class 'Iris-versicolor'.
> This contains enough information to get all useful information through
> introspection and simple functions. I already implemented a small module
> to do
> basic things such as:
> - selecting only a subset of all samples.
> - selecting only a subset of the attributes (only sepal length and
> width, for example).
> - selecting only the samples of a given class.
> - small summary of the dataset.
> This is implemented in less than 100 lines, which tends to show that the
> design is not too simplistic.
> Remaining problems:
> I see mainly two big problems:
> - if the dataset is big and cannot fit into memory, what kind of
> API do
> we want to avoid loading all the data in memory ? Can we use
> mapped arrays ?
> - Missing data: I thought about subclassing both record arrays and
> masked arrays classes, but I don't know if this is feasable,
> or even
> makes sense. I have the feeling that some Data mining software
> Nan (for example, weka seems to use float internally), but this
> prevents them from representing integer data.
> Current implementation
> An implementation following the above design is available in
> scikits.learn.datasets. If you installed scikits.learn, you can execute
> file learn/utils/attrselect.py, which shows the information you can easily
> extract for now from this model.
> Also, once the above problems are solved, an arff converter will be
> arff is the format used by WEKA, and many datasets are available at this
> Although the datasets package emerged from the learn package, I try to
> keep it
> independant from everything else, that is once we agree on the remaining
> problems and where the package should go, it can easily be put elsewhere
> without too much trouble.
> SciPy-user mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the SciPy-user