[Numpy-discussion] Possible roadmap addendum: building better text file readers
Wed Mar 7 09:49:47 CST 2012
On Tue, Mar 6, 2012 at 4:45 PM, Chris Barker <email@example.com> wrote:
> On Thu, Mar 1, 2012 at 10:58 PM, Jay Bourque <firstname.lastname@example.org> wrote:
> > 1. Loading text files using loadtxt/genfromtxt need a significant
> > performance boost (I think at least an order of magnitude increase in
> > performance is very doable based on what I've seen with Erin's recfile
> > 2. Improved memory usage. Memory used for reading in a text file
> > be more than the file itself, and less if only reading a subset of file.
> > 3. Keep existing interfaces for reading text files (loadtxt, genfromtxt,
> > etc). No new ones.
> > 4. Underlying code should keep IO iteration and transformation of data
> > separate (awaiting more thoughts from Travis on this).
> > 5. Be able to plug in different transformations of data at low level
> > awaiting more thoughts from Travis).
> > 6. memory mapping of text files?
> > 7. Eventually reduce memory usage even more by using same object for
> > duplicate values in array (depends on implementing enum dtype?)
> > Anything else?
> Yes -- I'd like to see the solution be able to do high -performance
> reads of a portion of a file -- not always the whole thing. I seem to
> have a number of custom text files that I need to read that are laid
> out in chunks: a bit of a header, then a block of number, another
> header, another block. I'm happy to read and parse the header sections
> with pure pyton, but would love a way to read the blocks of numbers
> into a numpy array fast. This will probably come out of the box with
> any of the proposed solutions, as long as they start at the current
> position of a passes-in fiel object, and can be told how much to read,
> then leave the file pointer in the correct position.
If you are setup with Cython to build extension modules, and you don't mind
testing an unreleased and experimental reader, you can try the text reader
that I'm working on: https://github.com/WarrenWeckesser/textreader
You can read a file like this, where the first line gives the number of
rows of the following array, and that pattern repeats:
1.0, 2.0, 3.0
4.0, 5.0, 6.0
7.0, 8.0, 9.0
10.0, 11.0, 12.0
13.0, 14.0, 15.0
1.0, 1.5, 2.0, 2.5
3.0, 3.5, 4.0, 4.5
5.0, 5.5, 6.0, 6.5
1.0D2, 1.25D-1, 6.25D-2, 99
with code like this:
import numpy as np
from textreader import readrows
filename = 'data/multi.dat'
f = open(filename, 'r')
line = f.readline()
while len(line) > 0:
nrows = int(line)
a = readrows(f, np.float32, numrows=nrows, sci='D', delimiter=',')
line = f.readline()
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion