[Numpy-discussion] Possible roadmap addendum: building better text file readers
Chris Barker
chris.barker@noaa....
Tue Mar 20 17:59:25 CDT 2012
Warren et al:
On Wed, Mar 7, 2012 at 7:49 AM, Warren Weckesser
<warren.weckesser@enthought.com> wrote:
> If you are setup with Cython to build extension modules,
I am
> and you don't mind
> testing an unreleased and experimental reader,
and I don't.
> you can try the text reader
> that I'm working on: https://github.com/WarrenWeckesser/textreader
It just took me a while to get around to it!
First of all: this is pretty much exactly what I've been looking for
for years, and never got around to writing myself - thanks!
My comments/suggestions:
1) a docstring for the textreader module would be nice.
2) "tzoffset" -- this is tricky stuff. Ideally, it should be able to
parse an ISO datetime string timezone specifier, but short of that, I
think the default should be None or UTC -- time zones are too ugly to
presume anything!
3) it breaks with the old MacOS style line endings: \r only. Maybe no
need to support that, but it turns out one of my old test files still
had them!
4) when I try to read more rows than are in the file, I get:
File "textreader.pyx", line 247, in textreader.readrows
(python/textreader.c:3469)
ValueError: negative dimensions are not allowed
good to get an error, but it's not very informative!
5) for reading float64 values -- I get something different with
textreader than with the python "float()":
input: "678.901"
float("") : 678.90099999999995
textreader : 678.90100000000007
as close as the number of figures available, but curious...
5) Performance issue: in my case, I'm reading a big file that's in
chunks -- each one has a header indicating how many rows follow, then
the rows, so I parse it out bit by bit.
For smallish files, it's much faster than pure python, and almost as
fast as some old C code of mine that is far less flexible.
But for large files, -- it's much slower -- indeed slower than a pure
python version for my use case.
I did a simplified test -- with 10,000 rows:
total number of rows: 10000
pure python took: 1.410408 seconds
pure python chunks took: 1.613094 seconds
textreader all at once took: 0.067098 seconds
textreader in chunks took : 0.131802 seconds
but with 1,000,000 rows:
total number of rows: 1000000
total number of chunks: 1000
pure python took: 30.712564 seconds
pure python chunks took: 31.313225 seconds
textreader all at once took: 1.314924 seconds
textreader in chunks took : 9.684819 seconds
then it gets even worse with the chunk size smaller:
total number of rows: 1000000
total number of chunks: 10000
pure python took: 30.032246 seconds
pure python chunks took: 42.010589 seconds
textreader all at once took: 1.318613 seconds
textreader in chunks took : 87.743729 seconds
my code, which is C that essentially runs fscanf over the file, has
essentially no performance hit from doing it in chunks -- so I think
something is wrong here.
Sorry, I haven't dug into the code to try to figure out what yet --
does it rewind the file each time maybe?
Enclosed is my test code.
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_performance.py
Type: application/octet-stream
Size: 3293 bytes
Desc: not available
Url : http://mail.scipy.org/pipermail/numpy-discussion/attachments/20120320/28bd9049/attachment.obj
More information about the NumPy-Discussion
mailing list