[Numpy-discussion] Possible roadmap addendum: building better text file readers
Wed Feb 29 09:11:51 CST 2012
Excerpts from Nathaniel Smith's message of Tue Feb 28 17:22:16 -0500 2012:
> > Even for binary, there are pathological cases, e.g. 1) reading a random
> > subset of nearly all rows. 2) reading a single column when rows are
> > small. In case 2 you will only go this route in the first place if you
> > need to save memory. The user should be aware of these issues.
> FWIW, this route actually doesn't save any memory as compared to np.memmap.
Actually, for numpy.memmap you will read the whole file if you try to
grab a single column and read a large fraction of the rows. Here is an
example that will end up pulling the entire file into memory
I just tested this on a 3G binary file and I'm sitting at 3G memory
usage. I believe this is because numpy.memmap only understands rows. I
don't fully understand the reason for that, but I suspect it is related
to the fact that the ndarray really only has a concept of itemsize, and
the fields are really just a reinterpretation of those bytes. It may be
that one could tweak the ndarray code to get around this. But I would
appreciate enlightenment on this subject.
This fact was the original motivator for writing my code; the text
reading ability came later.
> Cool. I'm just a little concerned that, since we seem to have like...
> 5 different implementations of this stuff all being worked on at the
> same time, we need to get some consensus on which features actually
> matter, so they can be melded together into the Single Best File
> Reader Evar. An interface where indexing and file-reading are combined
> is significantly more complicated than one where the core file-reading
> inner-loop can ignore indexing. So far I'm not sure why this
> complexity would be worthwhile, so that's what I'm trying to
I think I've addressed the reason why the low level C code was written.
And I think a unified, high level interface to binary and text files,
which the Recfile class provides, is worthwhile.
Can you please say more about "...one where the core file-reading
inner-loop can ignore indexing"? I didn't catch the meaning.
> -- Nathaniel
> > Also, for some crazy ascii files we may want to revert to pure python
> > anyway, but I think these should be special cases that can be flagged
> > at runtime through keyword arguments to the python functions.
> > BTW, did you mean to go off-list?
> > cheers,
> > -e
> > --
> > Erin Scott Sheldon
> > Brookhaven National Laboratory
Erin Scott Sheldon
Brookhaven National Laboratory
More information about the NumPy-Discussion