[Numpy-discussion] How to read data from text files fast?
Chris Barker
Chris.Barker at noaa.gov
Thu Jul 8 16:21:16 CDT 2004
Chris Barker wrote:
>> can't
>> you just preallocate the array and read your data directly into it?
>
> The short answer is that I'm not very smart! The longer answer is that
> this is because at first I misunderstood what PyArray_FromDimsAndData
> was for. For ScanFileN, I'll re-do it as you suggest.
I've re-done it. Now I don't double allocate storage for ScanFileN.
There was no noticeable difference in performance, but why use memory
you don't have to?
For ScanFile, it is unknown at the beginning how big the final array is,
so I now have two versions. One is what I had before, it allocates
memory in blocks of some Buffersize as it reads the file (now set to
1024 elements). Once it's all read in, it creates an appropriate size
PyArray, and copies the data to it. This results in a double copy of all
the data until the temporary memory is freed.
I now also have a ScanFile2, which scans the whole file first, then
creates a PyArray, and re-reads the file to fill it up. This version
takes about twice as long, confirming my expectation that the time to
allocate and copy data is tiny compared to reading and parsing the file.
Here's a simple benchmark:
Reading with Standard Python methods
(62936, 2)
it took 2.824013 seconds to read the file with standard Python methods
Reading with FileScan
(62936, 2)
it took 0.400936 seconds to read the file with FileScan
Reading with FileScan2
(62936, 2)
it took 0.752649 seconds to read the file with FileScan2
Reading with FileScanN
(62936, 2)
it took 0.441714 seconds to read the file with FileScanN
So it takes twice as long to count the numbers first, but it's still
three times as fast as just doing all this with Python. However, I
usually don't think it's worth all this effort for a 3 times speed up,
and I tend to make copies my arrays all over the place with NumPy
anyway, so I'm inclined to stick with the first method. Also, if you are
really that tight on memory, you could always read it in chunks with
ScanFileN.
Any feedback anyone wants to give is very welcome.
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
NOAA/OR&R/HAZMAT (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: FileScan_module.c
Url: http://projects.scipy.org/pipermail/numpy-discussion/attachments/20040708/a0eb5fcb/attachment-0001.c
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: setup.py
Url: http://projects.scipy.org/pipermail/numpy-discussion/attachments/20040708/a0eb5fcb/attachment-0002.pl
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: TestFileScan.py
Url: http://projects.scipy.org/pipermail/numpy-discussion/attachments/20040708/a0eb5fcb/attachment-0003.pl
More information about the Numpy-discussion
mailing list