[Numpy-discussion] Possible roadmap addendum: building better text file readers

Erin Sheldon erin.sheldon@gmail....
Thu Feb 23 14:33:54 CST 2012


Excerpts from Wes McKinney's message of Thu Feb 23 15:24:44 -0500 2012:
> On Thu, Feb 23, 2012 at 3:23 PM, Erin Sheldon <erin.sheldon@gmail.com> wrote:
> > I designed the recfile package to fill this need.  It might be a start.
> Can you relicense as BSD-compatible?

If required, that would be fine with me.
-e

> 
> > Excerpts from Wes McKinney's message of Thu Feb 23 14:32:13 -0500 2012:
> >> dear all,
> >>
> >> I haven't read all 180 e-mails, but I didn't see this on Travis's
> >> initial list.
> >>
> >> All of the existing flat file reading solutions I have seen are
> >> not suitable for many applications, and they compare very unfavorably
> >> to tools present in other languages, like R. Here are some of the
> >> main issues I see:
> >>
> >> - Memory usage: creating millions of Python objects when reading
> >>   a large file results in horrendously bad memory utilization,
> >>   which the Python interpreter is loathe to return to the
> >>   operating system. Any solution using the CSV module (like
> >>   pandas's parsers-- which are a lot faster than anything else I
> >>   know of in Python) suffers from this problem because the data
> >>   come out boxed in tuples of PyObjects. Try loading a 1,000,000
> >>   x 20 CSV file into a structured array using np.genfromtxt or
> >>   into a DataFrame using pandas.read_csv and you will immediately
> >>   see the problem. R, by contrast, uses very little memory.
> >>
> >> - Performance: post-processing of Python objects results in poor
> >>   performance. Also, for the actual parsing, anything regular
> >>   expression based (like the loadtable effort over the summer,
> >>   all apologies to those who worked on it), is doomed to
> >>   failure. I think having a tool with a high degree of
> >>   compatibility and intelligence for parsing unruly small files
> >>   does make sense though, but it's not appropriate for large,
> >>   well-behaved files.
> >>
> >> - Need to "factorize": as soon as there is an enum dtype in
> >>   NumPy, we will want to enable the file parsers for structured
> >>   arrays and DataFrame to be able to "factorize" / convert to
> >>   enum certain columns (for example, all string columns) during
> >>   the parsing process, and not afterward. This is very important
> >>   for enabling fast groupby on large datasets and reducing
> >>   unnecessary memory usage up front (imagine a column with a
> >>   million values, with only 10 unique values occurring). This
> >>   would be trivial to implement using a C hash table
> >>   implementation like khash.h
> >>
> >> To be clear: I'm going to do this eventually whether or not it
> >> happens in NumPy because it's an existing problem for heavy
> >> pandas users. I see no reason why the code can't emit structured
> >> arrays, too, so we might as well have a common library component
> >> that I can use in pandas and specialize to the DataFrame internal
> >> structure.
> >>
> >> It seems clear to me that this work needs to be done at the
> >> lowest level possible, probably all in C (or C++?) or maybe
> >> Cython plus C utilities.
> >>
> >> If anyone wants to get involved in this particular problem right
> >> now, let me know!
> >>
> >> best,
> >> Wes
> > --
> > Erin Scott Sheldon
> > Brookhaven National Laboratory
-- 
Erin Scott Sheldon
Brookhaven National Laboratory


More information about the NumPy-Discussion mailing list