[Numpy-discussion] Pickle, pytables, and sqlite - loading and saving recarray's

Vincent Nijs v-nijs@kellogg.northwestern....
Fri Jul 20 09:10:34 CDT 2007


Thanks Francesc!

That does work much better:

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
PyTables version:  2.0
HDF5 version:      1.6.5
NumPy version:     1.0.4.dev3852
Zlib version:      1.2.3
BZIP2 version:     1.0.2 (30-Dec-2001)
Python version:    2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)]
Platform:          darwin-Power Macintosh
Byte-ordering:     big
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Test saving recarray using cPickle: 1.620880 sec/pass
Test saving recarray with pytables: 2.074591 sec/pass
Test saving recarray with pytables (with zlib): 14.320498 sec/pass


Test loading recarray using cPickle: 1.023015 sec/pass
Test loading recarray with pytables: 0.882411 sec/pass
Test loading recarray with pytables (with zlib): 3.692698 sec/pass


On 7/20/07 6:17 AM, "Francesc Altet" <faltet@carabos.com> wrote:

> A Divendres 20 Juliol 2007 04:42, Vincent Nijs escrigué:
>> I am interesting in using sqlite (or pytables) to store data for scientific
>> research. I wrote the attached test program to save and load a simulated
>> 11x500,000 recarray. Average save and load times are given below (timeit
>> with 20 repetitions). The save time for sqlite is not really fair because I
>> have to delete the data table each time before I create the new one. It is
>> still pretty slow in comparison. Loading the recarray from sqlite is
>> significantly slower than pytables or cPickle. I am hoping there may be
>> more efficient ways to save and load recarray¹s from/to sqlite than what I
>> am now doing. Note that I infer the variable names and types from the data
>> rather than specifying them manually.
>> 
>> I¹d luv to hear from people using sqlite, pytables, and cPickle about their
>> experiences.
>> 
>> saving recarray with cPickle:       1.448568 sec/pass
>> saving recarray with pytable:      3.437228 sec/pass
>> saving recarray with sqlite:         193.286204 sec/pass
>> 
>> loading recarray using cPickle:    0.471365 sec/pass
>> loading recarray with pytable:     0.692838 sec/pass
>> loading recarray with sqlite:        15.977018 sec/pass
> 
> For a more fair comparison, and for large amounts of data, you should inform
> PyTables about the expected number of rows (see [1]) that you will end
> feeding into the tables so that it can choose the best chunksize for I/O
> purposes.
> 
> I've redone the benchmarks (the new script is attached) with
> this 'optimization' on and here are my numbers:
> 
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> PyTables version:  2.0
> HDF5 version:      1.6.5
> NumPy version:     1.0.3
> Zlib version:      1.2.3
> LZO version:       2.01 (Jun 27 2005)
> Python version:    2.5 (r25:51908, Nov  3 2006, 12:01:01)
> [GCC 4.0.2 20050901 (prerelease) (SUSE Linux)]
> Platform:          linux2-x86_64
> Byte-ordering:     little
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Test saving recarray using cPickle: 0.197113 sec/pass
> Test saving recarray with pytables: 0.234442 sec/pass
> Test saving recarray with pytables (with zlib): 1.973649 sec/pass
> Test saving recarray with pytables (with lzo): 0.925558 sec/pass
> 
> Test loading recarray using cPickle: 0.151379 sec/pass
> Test loading recarray with pytables: 0.165399 sec/pass
> Test loading recarray with pytables (with zlib): 0.553251 sec/pass
> Test loading recarray with pytables (with lzo): 0.264417 sec/pass
> 
> As you can see, the differences between raw cPickle and PyTables are much less
> than not informing about the total number of rows.  In fact, an automatic
> optimization can easily be done in PyTables so that when the user is passing
> a recarray, the total length of the recarray would be compared with the
> default number of expected rows (currently 10000), and if the former is
> larger, then the length of the recarray should be chosen instead.
> 
> I also have added the times when using compression just in case you are
> interested using it.  Here are the final file sizes:
> 
> $ ls -sh data
> total 132M
> 24M data-lzo.h5  43M data-None.h5  43M data.pickle  25M data-zlib.h5
> 
> Of course, this is using completely random data, but with real data the
> compression levels are expected to be higher than this.
> 
> [1] http://www.pytables.org/docs/manual/ch05.html#expectedRowsOptim
> 
> Cheers,

-- 
Vincent R. Nijs
Assistant Professor of Marketing
Kellogg School of Management, Northwestern University
2001 Sheridan Road, Evanston, IL 60208-2001
Phone: +1-847-491-4574 Fax: +1-847-491-2498
E-mail: v-nijs@kellogg.northwestern.edu
Skype: vincentnijs





More information about the Numpy-discussion mailing list