[SciPy-User] IO of large ASCII table data

Benjamin Root ben.root@ou....
Tue Aug 17 13:07:59 CDT 2010


On Tue, Aug 17, 2010 at 1:03 PM, Keith Goodman <kwgoodman@gmail.com> wrote:

> On Tue, Aug 17, 2010 at 10:53 AM, Éric Depagne <edepagne@lcogt.net> wrote:
> > Le mardi 17 août 2010 10:41:26, Dan Lussier a écrit :
> >> I am looking to read in large (many million rows) ASCII space
> >> separated tables into numpy arrays.
> >>
> >> In the past I have heard of people using Miller's TableIO to do this
> >> but was wondering if a similarly fast method has been more recently
> >> integrated into scipy/numpy?
> >>
> >> In consulting the documentation the most likely candidate is
> >> numpy.genfromtext(...).  Is this function pure python or does it rely
> >> on a C extension as was the case with Miller's TableIO?
> >>
> >> Any advice here would be great as my application could get seriously
> >> bogged down (both time and memory) in reading these files into arrays
> >> if I get onto the wrong track.
> >>
> >> Thanks.
> > There is the numpy.loadtxt() method that can also read data from file.
> > I use it to read large datasets. Considering its speed, here are numbers
> I
> > typically get. To extract 2.5 million lines and 10 columns it needs ~3mn.
>
> For comparison, h5py (and pytables) are over 1500 times faster:
>
> Save data:
>
> >> arr = np.random.rand(2500000, 10)
> >> import h5py
> >> f = h5py.File('/tmp/speed.hdf5')
> >> f['arr'] = arr
>
> Time the loading of data:
>
> $ ipython
> >> import time
> >> import h5py
> >> f = h5py.File('/tmp/speed.hdf5')
> >> t1=time.time(); a = f['arr'][:]; print time.time() - t1
> 0.0953390598297
>
> Speed up:
>
> >> 3*60/0.0953390598297
>   1887.9984795479013
>

Keith,

Note that files saved to the /tmp directory are likely using tmpfs, which is
heavily RAM oriented.  Your speed-up might not be reflecting the impact of
disk I/O.

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20100817/e89dd919/attachment.html 


More information about the SciPy-User mailing list