[Numpy-discussion] How to read data from text files fast?

Chris Barker Chris.Barker at noaa.gov
Thu Jul 8 10:58:07 CDT 2004


Thanks to Fernando Perez  and Travis Oliphant for pointing me to:

> scipy.io.read_array

In testing, I've found that it's very slow (for my needs), though quite 
nifty in other ways, so I'm sure I'll find a use for it in the future.

Travis Oliphant wrote:

 > Alternatively, we could move some of the Python code in read_array to 
 > C to improve the speed.

That was beyond me, so I wrote a very simple module in C that does what 
I want, and it is very much faster than read_array or straight python 
version. It has two functions:

FileScan(file)
"""
Reads all the values in rest of the ascii file, and produces a Numeric
vector full of Floats (C doubles).

All text in the file that is not part of a floating point number is
skipped over.
"""

FileScanN(file, N)

"""
Reads N values in the ascii file, and produces a Numeric vector of
length N full of Floats (C doubles).

Raises an exception if there are fewer than N  numbers in the file.

All text in the file that is not part of a floating point number is
skipped over.

After reading N numbers, the file is left before the next non-whitespace
character in the file. This will often leave the file at the start of
the next line, after scanning a line full of numbers.
"""

I implemented them separately, 'cause I wasn't sure how to deal with 
optional arguments in a C function. They could easily have wrapped in a 
Python function if you wanted one interface.

FileScan was much more complex, as I had to deal with all the dynamic 
memory allocation. I probably took a more complex approach to this than 
I had to, but it was an exercise for me, being a newbie at C.

I also decided not to specify a shape for the resulting array, always 
returning a rank-1 array, as that made the code easier, and you can 
always set A.shape afterward. This could be put in a Python wrapper as well.

It has the obvious limitation that it only does doubles. I'd like to add 
longs as well, but probably won't have a need for anything else. The way 
memory is these days, it seems just as easy to read the long ones, and 
convert afterward if you want.

Here is a quick benchmark (see below) run with a file that is 63,000 
lines, with two comma-delimited numbers on each line. Run on a 1GHz P4 
under Linux.

Reading with read_array
it took 16.351712 seconds to read the file with read_array
Reading with Standard Python methods
it took 2.832078 seconds to read the file with standard Python methods
Reading with FileScan
it took 0.444431 seconds to read the file with FileScan
Reading with FileScanN
it took 0.407875 seconds to read the file with FileScanN

As you can see, read_array is painfully slow for this kind of thing, 
straight Python is OK, and FileScan is pretty darn fast.

I've enclosed the C code and setup.py, if anyone wants to take a look, 
and use it, or give suggestions or bug fixes or whatever, that would be 
great.

In particular, I don't think I've structured the code very well, and 
there could be memory leak, which I have not tested carefully for.

Tested only on Linux with Python2.3.3, Numeric 23.1. If someone wants to 
  port it to numarray, that would be great too.

-Chris


The benchmark:

def test6():
     """
     Testing various IO options
     """
     from scipy.io import array_import

     filename = "JunkBig.txt"
     file = open(filename)
     print "Reading with read_array"
     start = time.time()
     A = array_import.read_array(file,",")
     print "it took %f seconds to read the file with 
read_array"%(time.time() - start)
     file.close()

     file = open(filename)
     print "Reading with Standard Python methods"
     start = time.time()
     A = []
     for line in file:
         A.append( map ( float, line.strip().split(",") ) )
     A = array(A)
     print "it took %f seconds to read the file with standard Python 
methods"%(time.time() - start)
     file.close()

     file = open(filename)
     print "Reading with FileScan"
     start = time.time()
     A = FileScanner.FileScan(file)
     A.shape = (-1,2)
     print "it took %f seconds to read the file with 
FileScan"%(time.time() - start)
     file.close()

     file = open(filename)
     print "Reading with FileScanN"
     start = time.time()
     A = FileScanner.FileScanN(file, product(A.shape) )
     A.shape = (-1,2)
     print "it took %f seconds to read the file with 
FileScanN"%(time.time() - start)

-- 
Christopher Barker, Ph.D.
Oceanographer
                                     		
NOAA/OR&R/HAZMAT         (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: FileScan_module.c
Url: http://projects.scipy.org/pipermail/numpy-discussion/attachments/20040708/ede864ba/attachment.c 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: setup.py
Url: http://projects.scipy.org/pipermail/numpy-discussion/attachments/20040708/ede864ba/attachment.pl 


More information about the Numpy-discussion mailing list