[Numpy-discussion] loading data

Neil Martinsen-Burrell nmb@wartburg....
Thu Jun 25 20:35:29 CDT 2009


On Thu, June 25, 2009 7:59 pm, Mag Gam wrote:
> I am very new to NumPy and Python. We are doing some research in our
> Physics lab and we need to store massive amounts of data (100GB
> daily). I therefore, am going to use hdf5 and h5py. The problem is I
> am using np.loadtxt() to create my array and create a dataset
> according to that. np.loadtxt() is reading a file which is about 50GB.
> This takes a very long time! I was wondering if there was a much
> easier and better way of doing this.

50 GB is a *lot* of data to read from a disk into memory (if you really do
have that much memory).  A magnetic hard drive can read less than 150
MB/s, so just to read the blocks off the disk would take over 5 minutes. 
np.loadtxt has additional processing on top of that.  I think you may be
interested in PyTables (www.pytables.org) or np.memmap, although since you
have already settled on HDF5, PyTables would be a natural choice, since it
can process on-disk datasets as if they were NumPy arrays (which might be
nice if you don't have all 50GB of memory).

-Neil




More information about the Numpy-discussion mailing list