[SciPy-user] Fast saving/loading of huge matrices
Thu Apr 19 18:01:44 CDT 2007
I have a very similar question. Pytables clearly has much more
capability than I need and the documentation is a bit intimidating. I
have tests that involve multiple channels of data that I need to
store. Can you give a simple example of using pytables to store 3
seperate Nx1 vectors in the same file and easily retreive the
individual channels. The cPickle equivalent would be something like:
and then dump mydict to a pickle file. How would I do this samething
On 4/19/07, Vincent Nijs <email@example.com> wrote:
> Pytables looks very interesting and clearly has a ton of features. However,
> if I am trying to just read-in a csv file can it figure out the correct data
> types on its own (e.g., dates, floats, strings)? Read "I am too lazy to
> types in variables names and types myself if the names are already in the
> file" :)
> Similarly can you just dump a dictionary or rec-array into a pytable with
> one 'save' command and have pytables figure out the variable names and
> types? This seems relevant since you wouldn't have to do that with cPickle
> which saves user-time if not computer time.
> Sorry if this is too off-topic.
> On 4/19/07 2:30 PM, "Francesc Altet" <firstname.lastname@example.org> wrote:
> > El dj 19 de 04 del 2007 a les 09:23 -0500, en/na Robert Kern va
> > escriure:
> >> Gael Varoquaux wrote:
> >>> I have a huge matrix (I don't know how big it is, it hasn't finished
> >>> loading yet, but the ascii file weights 381M). I was wondering what
> >>> format had best speed efficiency for saving/loading huge file. I don't
> >>> mind using a hdf5 even if it is not included in scipy itself.
> >> I think we've found that a simple pickle using protocol 2 works the fastest.
> >> At
> >> the time (a year or so ago) this was faster than PyTables for loading the
> >> entire
> >> array of about 1GB size. PyTables might be better now, possibly because of
> >> the
> >> new numpy support.
> > I was curious as well if PyTables 2.0 is getting somewhat faster than
> > 1.4 series (although I already knew that for this sort of things, the
> > space for improvement should be rather small).
> > For that, I've made a small benchmark (see attachments) and compared the
> > performance for PyTables 1.4 and 2.0 against pickle (protocol 2). In the
> > benchmark, a NumPy array of around 1 GB is created and the time for
> > writing and reading it from disk is written to stdout. You can see the
> > outputs for the runs in the attachments as well.
> >> From there, some conclusions can be draw:
> > 1. The difference of performance between PyTables 1.4 and 2.0 for this
> > especific task is almost negligible. This was somthing expected because,
> > although 1.4 was using numarray at the core, the use of the array
> > protocol made unnecessary the copies of the arrays (and hence, the
> > overhead over 2.0, with NumPy at the core, is negligible).
> > 2. For writing, the EArray (Extensible Array) object of PyTables has
> > roughly the same speed than NumPy (a 15% faster in fact, but this is not
> > that much). However, for reading, the speed-up of PyTables over pickle
> > is more than 2x (up to 2.35x for 2.0), which is something to consider.
> > 3. For compressed EArrays, writing times are relatively bad: between
> > 0.06x (zlib and PyTables 1.4) and 0.15x (lzo and PyTables 2.0). However,
> > for reading the ratios are quite good: between 0.57x (zlib and PyTables
> > 1.4) and 1.45x (lzo and PyTables 2.0). In general, one should expect
> > better performance from compressed data, but I've chosen completely
> > random data here, so the compressors weren't able to achieve even decent
> > compression ratios and that hurts I/O performance quite a few.
> > 4. The best performance is achieved by the simple (it doesn't allow to
> > be enlarged nor compressed), but rather effective in terms of I/O, Array
> > object. For writing, it can be up to 1.74x faster (using PyTables 2.0)
> > than pickle and up to 3.56x (using PyTables 1.4) for reading, which is
> > quite a lot (more than 500 MB/s) in terms of I/O speed.
> > I will warn the reader that these times are taken *without* having in
> > account the flush time to disk for writing. When this time is taken, the
> > gap between PyTables and pickle will reduce significantly (but not when
> > using compression, were PyTables will continue to be rather slower in
> > comparison). So, you should take the the above figures as *peak*
> > throughputs (that can be achieved when the dataset fits comfortably in
> > the main memory because of the filesystem cache).
> > For reading, and when the files doesn't fit in the filesystem cache or
> > are read from the first time one should expect an important degrading
> > over all the figures that I presented here. However, when using
> > compression over real data (where a 2x or more compression ratios are
> > realistic), the compressed EArray should be up to 2x faster (I've
> > noticed this many times in other contexts) for reading than other
> > solutions (this is so because one have to read less data from disk and
> > moreover, CPUs today are exceedingly fast at decompressing).
> > The above benchmarks have been run on a Linux machine running SuSe Linux
> > with an AMD Opteron @ 2 GHz, 8 GB of main memory and a 7200 rpm IDE
> > disk.
> > Cheers,
> SciPy-user mailing list
More information about the SciPy-user