[Numpy-discussion] Possible roadmap addendum: building better text file readers

Chris Barker chris.barker@noaa....
Tue Mar 6 16:45:34 CST 2012

On Thu, Mar 1, 2012 at 10:58 PM, Jay Bourque <jayvius@gmail.com> wrote:

> 1. Loading text files using loadtxt/genfromtxt need a significant
> performance boost (I think at least an order of magnitude increase in
> performance is very doable based on what I've seen with Erin's recfile code)

> 2. Improved memory usage. Memory used for reading in a text file shouldn’t
> be more than the file itself, and less if only reading a subset of file.

> 3. Keep existing interfaces for reading text files (loadtxt, genfromtxt,
> etc). No new ones.

> 4. Underlying code should keep IO iteration and transformation of data
> separate (awaiting more thoughts from Travis on this).

> 5. Be able to plug in different transformations of data at low level (also
> awaiting more thoughts from Travis).

> 6. memory mapping of text files?

> 7. Eventually reduce memory usage even more by using same object for
> duplicate values in array (depends on implementing enum dtype?)

> Anything else?

Yes -- I'd like to see the solution be able to do high -performance
reads of a portion of a file -- not always the whole thing. I seem to
have a number of custom text files that I need to read that are laid
out in chunks: a bit of a header, then a block of number, another
header, another block. I'm happy to read and parse the header sections
with pure pyton, but would love a way to read the blocks of numbers
into a numpy array fast. This will probably come out of the box with
any of the proposed solutions, as long as they start at the current
position of a passes-in fiel object, and can be told how much to read,
then leave the file pointer in the correct position.

Great to see this moving forward.



Christopher Barker, Ph.D.

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception


More information about the NumPy-Discussion mailing list