[Numpy-discussion] loading data
Fri Jun 26 06:09:13 CDT 2009
I really like the slice by slice idea!
But, I don't know how to implement the code. Do you have any sample code?
I suspect its the writing portion thats taking the lonest. I did a
simple decompress test and its fast.
On Fri, Jun 26, 2009 at 7:05 AM, Francesc Alted<firstname.lastname@example.org> wrote:
> A Friday 26 June 2009 12:38:11 Mag Gam escrigué:
>> Thanks everyone for the great and well thought out responses!
>> To make matters worse, this is actually a 50gb compressed csv file. So
>> it looks like this, 2009.06.01.plasmasub.csv.gz
>> We get this data from another lab from the Westcoast every night
>> therefore I don't have the option to have this file natively in hdf5.
>> We are sticking with hdf5 because we have other applications that use
>> this data and we wanted to standardize hdf5.
> Well, since you are adopting HDF5, the best solution is that the Westcoast lab
> would send the file directly in HDF5. That will save you a lot of headaches.
> If this is not possible, then I think the best would be that you do some
> profiles in your code and see where the bottleneck is. Using cProfile
> normally offers a good insight on what's consuming more time in your
> There are three most probable hot spots, the decompressor (gzip) time, the
> np.loadtxt and the HDF5 writer function. If the problem is gzip, then you
> won't be unable to accelerate the conversion unless the other lab is willing
> to use a lighter compressor (lzop, for example). If it is np.loadtxt(), then
> you should ask yourself if you are trying to load everything in-memory; if you
> are, don't do that; just try to load & write slice by slice. Finally, if the
> problem is on the HDF5 write, try to use write array slices (and not record-
> by-record writes).
>> Also, I am curious about Neil's np.memmap. Do you have a some sample
>> code for mapping a compressed csv file into memory? and loading the
>> dataset into a dset (hdf5 structure)?
> No, np.memmap is meant to map *uncompressed binary* files in memory, so you
> can't follow this path.
> Francesc Alted
> Numpy-discussion mailing list
More information about the Numpy-discussion