[Numpy-discussion] How to read data from text files fast?

Chris Barker Chris.Barker at noaa.gov
Thu Jul 8 16:21:16 CDT 2004


Chris Barker wrote:

>> can't
>> you just preallocate the array and read your data directly into it?
> 
> The short answer is that I'm not very smart! The longer answer is that 
> this is because at first I misunderstood what PyArray_FromDimsAndData 
> was for. For ScanFileN, I'll re-do it as you suggest.

I've re-done it. Now I don't double allocate storage for ScanFileN. 
There was no noticeable difference in performance, but why use memory 
you don't have to?

For ScanFile, it is unknown at the beginning how big the final array is, 
so I now have two versions. One is what I had before, it allocates 
memory in blocks of some Buffersize as it reads the file (now set to 
1024 elements). Once it's all read in, it creates an appropriate size 
PyArray, and copies the data to it. This results in a double copy of all 
the data until the temporary memory is freed.

I now also have a ScanFile2, which scans the whole file first, then 
creates a PyArray, and re-reads the file to fill it up. This version 
takes about twice as long, confirming my expectation that the time to 
allocate and copy data is tiny compared to reading and parsing the file.

Here's a simple benchmark:

Reading with Standard Python methods
(62936, 2)
it took 2.824013 seconds to read the file with standard Python methods
Reading with FileScan
(62936, 2)
it took 0.400936 seconds to read the file with FileScan
Reading with FileScan2
(62936, 2)
it took 0.752649 seconds to read the file with FileScan2
Reading with FileScanN
(62936, 2)
it took 0.441714 seconds to read the file with FileScanN

So it takes twice as long to count the numbers first, but it's still 
three times as fast as just doing all this with Python. However, I 
usually don't think it's worth all this effort for a 3 times speed up, 
and I tend to make copies my arrays all over the place with NumPy 
anyway, so I'm inclined to stick with the first method. Also, if you are 
really that tight on memory, you could always read it in chunks with 
ScanFileN.

Any feedback anyone wants to give is very welcome.

-Chris


-- 
Christopher Barker, Ph.D.
Oceanographer
                                     		
NOAA/OR&R/HAZMAT         (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: FileScan_module.c
Url: http://projects.scipy.org/pipermail/numpy-discussion/attachments/20040708/a0eb5fcb/attachment-0001.c 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: setup.py
Url: http://projects.scipy.org/pipermail/numpy-discussion/attachments/20040708/a0eb5fcb/attachment-0002.pl 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: TestFileScan.py
Url: http://projects.scipy.org/pipermail/numpy-discussion/attachments/20040708/a0eb5fcb/attachment-0003.pl 


More information about the Numpy-discussion mailing list