[Numpy-discussion] Possible roadmap addendum: building better text file readers

Erin Sheldon erin.sheldon@gmail....
Mon Feb 27 08:44:52 CST 2012

Excerpts from Jay Bourque's message of Mon Feb 27 00:24:25 -0500 2012:
> Hi Erin,
> I'm the one Travis mentioned earlier about working on this. I was planning on 
> diving into it this week, but it sounds like you may have some code already that 
> fits the requirements? If so, I would be available to help you with 
> porting/testing your code with numpy, or I can take what you have and build on 
> it in my numpy fork on github.

Hi Jay,all -

What I've got is a solution for writing and reading structured arrays to
and from files, both in text files and binary files.  It is written in C
and python.  It allows reading arbitrary subsets of the data efficiently
without reading in the whole file.  It defines a class Recfile that
exposes an array like interface for reading, e.g. x=rf[columns][rows].

Limitations: Because it was designed with arrays in mind, it doesn't
deal with not fixed-width string fields.  Also, it doesn't deal with
quoted strings, as those are not necessary for writing or reading arrays
with fixed length strings.  Doesn't deal with missing data.  This is
where Wes' tokenizing-oriented code might be useful.  So there is a fair
amount of functionality to be added for edge cases, but it provides a
framework.  I think some of this can be written into the C code, others
will have to be done at the python level.

I've forked numpy on my github account, and should have the code added
in a few days.  I'll send mail when it is ready.  Help will be greatly
appreciated getting this to work with loadtxt, adding functionality from
Wes' and others code, and testing.  

Also, because it works on binary files too, I think it might be worth it
to make numpy.fromfile a python function, and to use a Recfile object
when reading subsets of the data. For example  numpy.fromfile(f,
rows=rows, columns=columns, dtype=dtype) could instantiate a Recfile
object to read the column and row subsets.  We could rename the C
fromfile to something appropriate, and call it when the whole file is
being read (recfile uses it internally when reading ranges).

Erin Scott Sheldon
Brookhaven National Laboratory

More information about the NumPy-Discussion mailing list