[SciPy-user] A first proposal for dataset organization
Thu Sep 20 05:05:24 CDT 2007
Robert Kern wrote:
> David Huard wrote:
>> Hi Anne,
>> 2007/9/19, Anne Archibald <email@example.com
>> On 18/09/2007, David Huard <firstname.lastname@example.org
>> <mailto:email@example.com>> wrote:
>> > For large data sets, I'm not sure I understand what you're
>> meaning. Do you
>> > intend to include netcdf or HDF5 files and provide an interface to
>> > those data sets so users don't have to bother about the underlying
>> engine ?
>> > Do we really want to distribute a package weighting > 1GB ?
>> One of the points of this project, as I understand it, is to make it
>> convenient for people to get and use real datasets. In particular, one
>> possibility is to not include the data in this package, but instead
>> only a script to download it from (say) the HEASARC. Thus big datasets
>> are not outrageous, and more to the point, we need to be able to deal
>> with them whatever form they are in natively.
>> My understanding was rather :
>> " ... to make it convenient for people to get and use real datasets for
>> use in SciPy and NumPy examples, documentation and tutorials. " This
>> limits the scope of the dataset package, at least for starters. If some
>> tutorial deals with larger than memory issues, then using a specialized
>> binary format makes sense. However, I think that pretty basic datasets
>> can illustrate the use of most SciPy and NumPy functions.
> That's an important use case, certainly, but I had in mind uses cases like the
> one Anne gave, too, when I suggested parts of the design that David implemented.
> The scope is still fairly broad.
Yes, indeed, my sentence "to make it convenient for people to get and
use real datasets for use in SciPy and NumPy examples, documentation and
tutorials" was just a list of possible usages, not the only usages to
take into account. I realized also that my proposal sounded like I was
the only involved, which was not the case. I hope people involved in
previous discussion on that matter didn't take any offence.
David (Huard) already highlighted one problem with my proposal (time
series representation). I would really be interested in comments about
using MaskedArrays to handle missing data (I've never used it myself),
and the use of record arrays for the data; for example, I can see cases
where record arrays may be a problem (if all your data are homogenous,
you cannot treat the data as a big numpy array), but I don't know if
this is significant.
More information about the SciPy-user