[SciPy-User] IO of large ASCII table data

Keith Goodman kwgoodman@gmail....
Tue Aug 17 13:03:13 CDT 2010


On Tue, Aug 17, 2010 at 10:53 AM, Éric Depagne <edepagne@lcogt.net> wrote:
> Le mardi 17 août 2010 10:41:26, Dan Lussier a écrit :
>> I am looking to read in large (many million rows) ASCII space
>> separated tables into numpy arrays.
>>
>> In the past I have heard of people using Miller's TableIO to do this
>> but was wondering if a similarly fast method has been more recently
>> integrated into scipy/numpy?
>>
>> In consulting the documentation the most likely candidate is
>> numpy.genfromtext(...).  Is this function pure python or does it rely
>> on a C extension as was the case with Miller's TableIO?
>>
>> Any advice here would be great as my application could get seriously
>> bogged down (both time and memory) in reading these files into arrays
>> if I get onto the wrong track.
>>
>> Thanks.
> There is the numpy.loadtxt() method that can also read data from file.
> I use it to read large datasets. Considering its speed, here are numbers I
> typically get. To extract 2.5 million lines and 10 columns it needs ~3mn.

For comparison, h5py (and pytables) are over 1500 times faster:

Save data:

>> arr = np.random.rand(2500000, 10)
>> import h5py
>> f = h5py.File('/tmp/speed.hdf5')
>> f['arr'] = arr

Time the loading of data:

$ ipython
>> import time
>> import h5py
>> f = h5py.File('/tmp/speed.hdf5')
>> t1=time.time(); a = f['arr'][:]; print time.time() - t1
0.0953390598297

Speed up:

>> 3*60/0.0953390598297
   1887.9984795479013


More information about the SciPy-User mailing list