[Numpy-discussion] loading data
Fri Jun 26 06:05:58 CDT 2009
A Friday 26 June 2009 12:38:11 Mag Gam escrigué:
> Thanks everyone for the great and well thought out responses!
> To make matters worse, this is actually a 50gb compressed csv file. So
> it looks like this, 2009.06.01.plasmasub.csv.gz
> We get this data from another lab from the Westcoast every night
> therefore I don't have the option to have this file natively in hdf5.
> We are sticking with hdf5 because we have other applications that use
> this data and we wanted to standardize hdf5.
Well, since you are adopting HDF5, the best solution is that the Westcoast lab
would send the file directly in HDF5. That will save you a lot of headaches.
If this is not possible, then I think the best would be that you do some
profiles in your code and see where the bottleneck is. Using cProfile
normally offers a good insight on what's consuming more time in your
There are three most probable hot spots, the decompressor (gzip) time, the
np.loadtxt and the HDF5 writer function. If the problem is gzip, then you
won't be unable to accelerate the conversion unless the other lab is willing
to use a lighter compressor (lzop, for example). If it is np.loadtxt(), then
you should ask yourself if you are trying to load everything in-memory; if you
are, don't do that; just try to load & write slice by slice. Finally, if the
problem is on the HDF5 write, try to use write array slices (and not record-
> Also, I am curious about Neil's np.memmap. Do you have a some sample
> code for mapping a compressed csv file into memory? and loading the
> dataset into a dset (hdf5 structure)?
No, np.memmap is meant to map *uncompressed binary* files in memory, so you
can't follow this path.
More information about the Numpy-discussion