[Numpy-discussion] Possible roadmap addendum: building better text file readers

Chris Barker chris.barker@noaa....
Tue Mar 20 17:59:25 CDT 2012


Warren et al:

On Wed, Mar 7, 2012 at 7:49 AM, Warren Weckesser
<warren.weckesser@enthought.com> wrote:
> If you are setup with Cython to build extension modules,

I am

> and you don't mind
> testing an unreleased and experimental reader,

and I don't.

> you can try the text reader
> that I'm working on: https://github.com/WarrenWeckesser/textreader

It just took me a while to get around to it!

First of all: this is pretty much exactly what I've been looking for
for years, and never got around to writing myself - thanks!

My comments/suggestions:

1) a docstring for the textreader module would be nice.

2) "tzoffset" -- this is tricky stuff. Ideally, it should be able to
parse an ISO datetime string timezone specifier, but short of that, I
think the default should be None or UTC -- time zones are too ugly to
presume anything!

3) it breaks with the old MacOS style line endings: \r only. Maybe no
need to support that, but it turns out one of my old test files still
had them!

4) when I try to read more rows than are in the file, I get:
   File "textreader.pyx", line 247, in textreader.readrows
(python/textreader.c:3469)
  ValueError: negative dimensions are not allowed

good to get an error, but it's not very informative!

5) for reading float64 values -- I get something different with
textreader than with the python "float()":
  input: "678.901"
  float("") :  678.90099999999995
  textreader : 678.90100000000007

as close as the number of figures available, but curious...


5) Performance issue: in my case, I'm reading a big file that's in
chunks -- each one has a header indicating how many rows follow, then
the rows, so I parse it out bit by bit.
For smallish files, it's much faster than pure python, and almost as
fast as some old C code of mine that is far less flexible.

But for large files,  -- it's much slower -- indeed slower than a pure
python version for my use case.

I did a simplified test -- with 10,000 rows:

total number of rows:  10000
pure python took: 1.410408 seconds
pure python chunks took: 1.613094 seconds
textreader all at once took: 0.067098 seconds
textreader in chunks took : 0.131802 seconds

but with 1,000,000 rows:

total number of rows:  1000000
total number of chunks:  1000
pure python took: 30.712564 seconds
pure python chunks took: 31.313225 seconds
textreader all at once took: 1.314924 seconds
textreader in chunks took : 9.684819 seconds

then it gets even worse with the chunk size smaller:

total number of rows:  1000000
total number of chunks:  10000
pure python took: 30.032246 seconds
pure python chunks took: 42.010589 seconds
textreader all at once took: 1.318613 seconds
textreader in chunks took : 87.743729 seconds

my code, which is C that essentially runs fscanf over the file, has
essentially no performance hit from doing it in chunks -- so I think
something is wrong here.

Sorry, I haven't dug into the code to try to figure out what yet --
does it rewind the file each time maybe?

Enclosed is my test code.

-Chris



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_performance.py
Type: application/octet-stream
Size: 3293 bytes
Desc: not available
Url : http://mail.scipy.org/pipermail/numpy-discussion/attachments/20120320/28bd9049/attachment.obj 


More information about the NumPy-Discussion mailing list