[Numpy-discussion] Possible roadmap addendum: building better text file readers

Erin Sheldon erin.sheldon@gmail....
Tue Feb 28 17:36:58 CST 2012


Hi All -

I've added the relevant code to my numpy fork here

    https://github.com/esheldon/numpy

The python module and c file are at /numpy/lib/recfile.py and
/numpy/lib/src/_recfile.c  Access from python is numpy.recfile

See below for the doc string for the main class, Recfile.  Some example
usage is shown.  As listed in the limitations section below, quoted
strings are not yet supported for text files.  This can be addressed by
optionally using some smarter code when reading strings from these types
of files.  I'd greatly appreciate some help with that aspect.

There is a test suite in numpy.recfile.test()

    A class for reading and writing structured arrays to and from files.

    Both binary and text files are supported.  Any subset of the data can be
    read without loading the whole file.  See the limitations section below for
    caveats.

    parameters
    ----------
    fobj: file or string
        A string or file object.
    mode: string
        Mode for opening when fobj is a string 
    dtype:
        A numpy dtype or descriptor describing each line of the file.  The
        dtype must contain fields. This is a required parameter; it is a
        keyword only for clarity. 

        Note for text files the dtype will be converted to native byte
        ordering.  Any data written to the file must also be in the native byte
        ordering.
    nrows: int, optional
        Number of rows in the file.  If not entered, the rows will be counted
        from the file itself. This is a simple calculation for binary files,
        but can be slow for text files.
    delim: string, optional
        The delimiter for text files.  If None or "" the file is
        assumed to be binary.  Should be a single character.
    skipheader: int, optional
        Skip this many lines in the header.
    offset: int, optional
        Move to this offset in the file.  Reads will all be relative to this
        location. If not sent, it is taken from the current positioin in the
        input file object or 0 if a filename was entered.

    string_newlines: bool, optional
        If true, strings in text files may contain newlines.  This is only
        relevant for text files when the nrows= keyword is not sent, because
        the number of lines must be counted.  
        
        In this case the full text reading code is used to count rows instead
        of a simple newline count.  Because the text is fully processed twice,
        this can double the time to read files.

    padnull: bool
        If True, nulls in strings are replaced with spaces when writing text
    ignorenull: bool
        If True, nulls in strings are not written when writing text.  This
        results in string fields that are not fixed width, so cannot be
        read back in using recfile

    limitations
    -----------
        Currently, only fixed width string fields are supported.  String fields
        can contain any characters, including newlines, but for text files
        quoted strings are not currently supported: the quotes will be part of
        the result.  For binary files, structured sub-arrays and complex can be
        writen and read, but this is not supported yet for text files. 

    examples
    ---------
        # read from binary file
        dtype=[('id','i4'),('x','f8'),('y','f8'),('arr','f4',(2,2))]
        rec=numpy.recfile.Recfile(fname,dtype=dtype)


        # read all data using either slice or method notation
        data=rec[:]
        data=rec.read()

        # read row slices
        data=rec[8:55:3]

        # read subset of columns and possibly rows
        # can use either slice or method notation
        data=rec['x'][:]
        data=rec['id','x'][:]
        data=rec[col_list][row_list]
        data=rec.read(columns=col_list, rows=row_list)

        # for text files, just send the delimiter string
        # all the above calls will also work
        rec=numpy.recfile.Recfile(fname,dtype=dtype,delim=',')

        # save time for text files by sending row count
        rec=numpy.recfile.Recfile(fname,dtype=dtype,delim=',',nrows=10000)

        # write some data
        rec=numpy.recfile.Recfile(fname,mode='w',dtype=dtype,delim=',')
        rec.write(data)

        # append some data
        rec.write(more_data)

        # print metadata about the file
        print rec
        Recfile  nrows: 345472 ncols: 6 mode: 'w'

          id                 <i4
          x                  <f8
          y                  <f8
          arr                <f4  array[2,2]

Excerpts from Erin Sheldon's message of Mon Feb 27 09:44:52 -0500 2012:
> Excerpts from Jay Bourque's message of Mon Feb 27 00:24:25 -0500 2012:
> > Hi Erin,
> > 
> > I'm the one Travis mentioned earlier about working on this. I was planning on 
> > diving into it this week, but it sounds like you may have some code already that 
> > fits the requirements? If so, I would be available to help you with 
> > porting/testing your code with numpy, or I can take what you have and build on 
> > it in my numpy fork on github.
> 
> Hi Jay,all -
> 
> What I've got is a solution for writing and reading structured arrays to
> and from files, both in text files and binary files.  It is written in C
> and python.  It allows reading arbitrary subsets of the data efficiently
> without reading in the whole file.  It defines a class Recfile that
> exposes an array like interface for reading, e.g. x=rf[columns][rows].
> 
> Limitations: Because it was designed with arrays in mind, it doesn't
> deal with not fixed-width string fields.  Also, it doesn't deal with
> quoted strings, as those are not necessary for writing or reading arrays
> with fixed length strings.  Doesn't deal with missing data.  This is
> where Wes' tokenizing-oriented code might be useful.  So there is a fair
> amount of functionality to be added for edge cases, but it provides a
> framework.  I think some of this can be written into the C code, others
> will have to be done at the python level.
> 
> I've forked numpy on my github account, and should have the code added
> in a few days.  I'll send mail when it is ready.  Help will be greatly
> appreciated getting this to work with loadtxt, adding functionality from
> Wes' and others code, and testing.  
> 
> Also, because it works on binary files too, I think it might be worth it
> to make numpy.fromfile a python function, and to use a Recfile object
> when reading subsets of the data. For example  numpy.fromfile(f,
> rows=rows, columns=columns, dtype=dtype) could instantiate a Recfile
> object to read the column and row subsets.  We could rename the C
> fromfile to something appropriate, and call it when the whole file is
> being read (recfile uses it internally when reading ranges).
> 
> thanks,
> -e
-- 
Erin Scott Sheldon
Brookhaven National Laboratory


More information about the NumPy-Discussion mailing list