[SciPy-user] Fast saving/loading of huge matrices

Francesc Altet faltet@carabos....
Fri Apr 20 02:05:40 CDT 2007


El dj 19 de 04 del 2007 a les 17:26 -0500, en/na Vincent Nijs va
escriure:
> Pytables looks very interesting and clearly has a ton of features. However,
> if I am trying to just read-in a csv file can it figure out the correct data
> types on its own (e.g., dates, floats, strings)? Read "I am too lazy to
> types in variables names and types myself if the names are already in the
> file" :)

PyTables itself doesn't have a csv importer as such, but provided the
existence of the csv module, making one shouldn't be difficult at all.

Regarding type discovering, no. PyTables is designed to cope with
extremely large amounts of data, and knowing exactly which type is
desired for each dataset is *crucial* for keeping the storage
requirements under a minimum.  However, if you don't mind about the
space that will take you data on disk, you can always import the csv
into a NumPy array (or recarray) and save it into PyTables in a
straightforward way (see below).

> Similarly can you just dump a dictionary or rec-array into a pytable with
> one 'save' command and have pytables figure out the variable names and
> types? This seems relevant since you wouldn't have to do that with cPickle
> which saves user-time if not computer time.

You can easily save any numpy array (or recarray) in pytables:

>>> import numpy
>>> import tables
>>> f=tables.openFile('/tmp/tmp.h5','w')
>>> na=numpy.arange(10).reshape(2,5)
# saving an array
>>> tna=f.createArray('/', 'na', na)
# retrieving the array
>>> na_fromdisk = na[:]
>>> na_fromdisk
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])
>>> ra=numpy.empty(shape=2, dtype='i4,f8')
# saving a recarray
>>> tra=f.createTable('/', 'ra', ra)
# retrieving the recarray
>>> ra_fromdisk=tra[:]
>>> ra_fromdisk
array([(-1209301768, 1.0675549246111695e-269),
       (135695288, -1.3687048847096049e-40)],
      dtype=[('f0', '<i4'), ('f1', '<f8')])

So, all in all, it's not that difficult to save a single array or
recarray in pytables: just a single call (f.createArray() or
f.createTable()).

Moreover, there is a substantial advantage of PyTables method for
reading over pickle: the former does support general extended slicing
(except for negative values of step). So, you can do this sort of
things:

>>> f.root.na[1]
array([5, 6, 7, 8, 9])
>>> f.root.na[1,2:4]
array([7, 8])
>>> f.root.na[1,2::2]
array([7, 9])
>>> f.root.ra[1]
(135695288, -1.3687048847096049e-40)
>>> f.root.ra[::2]
array([(-1209301768, 1.0675549246111695e-269)],
      dtype=[('f0', '<i4'), ('f1', '<f8')])


which is of utmost importance when you have datasets that don't fit in
main memory but that you want to deal with.

> Sorry if this is too off-topic.

Well, I don't think so, so don't be afraid to ask.

Cheers,

-- 
Francesc Altet    |  Be careful about using the following code --
Carabos Coop. V.  |  I've only proven that it works, 
www.carabos.com   |  I haven't tested it. -- Donald Knuth



More information about the SciPy-user mailing list