[Numpy-discussion] Possible roadmap addendum: building better text file readers
Wed Mar 21 00:41:25 CDT 2012
On Tue, Mar 20, 2012 at 5:59 PM, Chris Barker <firstname.lastname@example.org> wrote:
> Warren et al:
> On Wed, Mar 7, 2012 at 7:49 AM, Warren Weckesser
> <email@example.com> wrote:
> > If you are setup with Cython to build extension modules,
> I am
> > and you don't mind
> > testing an unreleased and experimental reader,
> and I don't.
> > you can try the text reader
> > that I'm working on: https://github.com/WarrenWeckesser/textreader
> It just took me a while to get around to it!
> First of all: this is pretty much exactly what I've been looking for
> for years, and never got around to writing myself - thanks!
> My comments/suggestions:
> 1) a docstring for the textreader module would be nice.
> 2) "tzoffset" -- this is tricky stuff. Ideally, it should be able to
> parse an ISO datetime string timezone specifier, but short of that, I
> think the default should be None or UTC -- time zones are too ugly to
> presume anything!
> 3) it breaks with the old MacOS style line endings: \r only. Maybe no
> need to support that, but it turns out one of my old test files still
> had them!
> 4) when I try to read more rows than are in the file, I get:
> File "textreader.pyx", line 247, in textreader.readrows
> ValueError: negative dimensions are not allowed
> good to get an error, but it's not very informative!
> 5) for reading float64 values -- I get something different with
> textreader than with the python "float()":
> input: "678.901"
> float("") : 678.90099999999995
> textreader : 678.90100000000007
> as close as the number of figures available, but curious...
> 5) Performance issue: in my case, I'm reading a big file that's in
> chunks -- each one has a header indicating how many rows follow, then
> the rows, so I parse it out bit by bit.
> For smallish files, it's much faster than pure python, and almost as
> fast as some old C code of mine that is far less flexible.
> But for large files, -- it's much slower -- indeed slower than a pure
> python version for my use case.
> I did a simplified test -- with 10,000 rows:
> total number of rows: 10000
> pure python took: 1.410408 seconds
> pure python chunks took: 1.613094 seconds
> textreader all at once took: 0.067098 seconds
> textreader in chunks took : 0.131802 seconds
> but with 1,000,000 rows:
> total number of rows: 1000000
> total number of chunks: 1000
> pure python took: 30.712564 seconds
> pure python chunks took: 31.313225 seconds
> textreader all at once took: 1.314924 seconds
> textreader in chunks took : 9.684819 seconds
> then it gets even worse with the chunk size smaller:
> total number of rows: 1000000
> total number of chunks: 10000
> pure python took: 30.032246 seconds
> pure python chunks took: 42.010589 seconds
> textreader all at once took: 1.318613 seconds
> textreader in chunks took : 87.743729 seconds
> my code, which is C that essentially runs fscanf over the file, has
> essentially no performance hit from doing it in chunks -- so I think
> something is wrong here.
> Sorry, I haven't dug into the code to try to figure out what yet --
> does it rewind the file each time maybe?
> Enclosed is my test code.
Thanks! The feedback is great. I won't have time to get back to this for
another week or so, but then I'll look into the issues you reported.
> Christopher Barker, Ph.D.
> Emergency Response Division
> NOAA/NOS/OR&R (206) 526-6959 voice
> 7600 Sand Point Way NE (206) 526-6329 fax
> Seattle, WA 98115 (206) 526-6317 main reception
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion