[Numpy-discussion] Possible roadmap addendum: building better text file readers
Tue Feb 28 17:36:58 CST 2012
Hi All -
I've added the relevant code to my numpy fork here
The python module and c file are at /numpy/lib/recfile.py and
/numpy/lib/src/_recfile.c Access from python is numpy.recfile
See below for the doc string for the main class, Recfile. Some example
usage is shown. As listed in the limitations section below, quoted
strings are not yet supported for text files. This can be addressed by
optionally using some smarter code when reading strings from these types
of files. I'd greatly appreciate some help with that aspect.
There is a test suite in numpy.recfile.test()
A class for reading and writing structured arrays to and from files.
Both binary and text files are supported. Any subset of the data can be
read without loading the whole file. See the limitations section below for
fobj: file or string
A string or file object.
Mode for opening when fobj is a string
A numpy dtype or descriptor describing each line of the file. The
dtype must contain fields. This is a required parameter; it is a
keyword only for clarity.
Note for text files the dtype will be converted to native byte
ordering. Any data written to the file must also be in the native byte
nrows: int, optional
Number of rows in the file. If not entered, the rows will be counted
from the file itself. This is a simple calculation for binary files,
but can be slow for text files.
delim: string, optional
The delimiter for text files. If None or "" the file is
assumed to be binary. Should be a single character.
skipheader: int, optional
Skip this many lines in the header.
offset: int, optional
Move to this offset in the file. Reads will all be relative to this
location. If not sent, it is taken from the current positioin in the
input file object or 0 if a filename was entered.
string_newlines: bool, optional
If true, strings in text files may contain newlines. This is only
relevant for text files when the nrows= keyword is not sent, because
the number of lines must be counted.
In this case the full text reading code is used to count rows instead
of a simple newline count. Because the text is fully processed twice,
this can double the time to read files.
If True, nulls in strings are replaced with spaces when writing text
If True, nulls in strings are not written when writing text. This
results in string fields that are not fixed width, so cannot be
read back in using recfile
Currently, only fixed width string fields are supported. String fields
can contain any characters, including newlines, but for text files
quoted strings are not currently supported: the quotes will be part of
the result. For binary files, structured sub-arrays and complex can be
writen and read, but this is not supported yet for text files.
# read from binary file
# read all data using either slice or method notation
# read row slices
# read subset of columns and possibly rows
# can use either slice or method notation
# for text files, just send the delimiter string
# all the above calls will also work
# save time for text files by sending row count
# write some data
# append some data
# print metadata about the file
Recfile nrows: 345472 ncols: 6 mode: 'w'
arr <f4 array[2,2]
Excerpts from Erin Sheldon's message of Mon Feb 27 09:44:52 -0500 2012:
> Excerpts from Jay Bourque's message of Mon Feb 27 00:24:25 -0500 2012:
> > Hi Erin,
> > I'm the one Travis mentioned earlier about working on this. I was planning on
> > diving into it this week, but it sounds like you may have some code already that
> > fits the requirements? If so, I would be available to help you with
> > porting/testing your code with numpy, or I can take what you have and build on
> > it in my numpy fork on github.
> Hi Jay,all -
> What I've got is a solution for writing and reading structured arrays to
> and from files, both in text files and binary files. It is written in C
> and python. It allows reading arbitrary subsets of the data efficiently
> without reading in the whole file. It defines a class Recfile that
> exposes an array like interface for reading, e.g. x=rf[columns][rows].
> Limitations: Because it was designed with arrays in mind, it doesn't
> deal with not fixed-width string fields. Also, it doesn't deal with
> quoted strings, as those are not necessary for writing or reading arrays
> with fixed length strings. Doesn't deal with missing data. This is
> where Wes' tokenizing-oriented code might be useful. So there is a fair
> amount of functionality to be added for edge cases, but it provides a
> framework. I think some of this can be written into the C code, others
> will have to be done at the python level.
> I've forked numpy on my github account, and should have the code added
> in a few days. I'll send mail when it is ready. Help will be greatly
> appreciated getting this to work with loadtxt, adding functionality from
> Wes' and others code, and testing.
> Also, because it works on binary files too, I think it might be worth it
> to make numpy.fromfile a python function, and to use a Recfile object
> when reading subsets of the data. For example numpy.fromfile(f,
> rows=rows, columns=columns, dtype=dtype) could instantiate a Recfile
> object to read the column and row subsets. We could rename the C
> fromfile to something appropriate, and call it when the whole file is
> being read (recfile uses it internally when reading ranges).
Erin Scott Sheldon
Brookhaven National Laboratory
More information about the NumPy-Discussion