[Numpy-discussion] Possible roadmap addendum: building better text file readers
Thu Feb 23 14:33:54 CST 2012
Excerpts from Wes McKinney's message of Thu Feb 23 15:24:44 -0500 2012:
> On Thu, Feb 23, 2012 at 3:23 PM, Erin Sheldon <firstname.lastname@example.org> wrote:
> > I designed the recfile package to fill this need. It might be a start.
> Can you relicense as BSD-compatible?
If required, that would be fine with me.
> > Excerpts from Wes McKinney's message of Thu Feb 23 14:32:13 -0500 2012:
> >> dear all,
> >> I haven't read all 180 e-mails, but I didn't see this on Travis's
> >> initial list.
> >> All of the existing flat file reading solutions I have seen are
> >> not suitable for many applications, and they compare very unfavorably
> >> to tools present in other languages, like R. Here are some of the
> >> main issues I see:
> >> - Memory usage: creating millions of Python objects when reading
> >> a large file results in horrendously bad memory utilization,
> >> which the Python interpreter is loathe to return to the
> >> operating system. Any solution using the CSV module (like
> >> pandas's parsers-- which are a lot faster than anything else I
> >> know of in Python) suffers from this problem because the data
> >> come out boxed in tuples of PyObjects. Try loading a 1,000,000
> >> x 20 CSV file into a structured array using np.genfromtxt or
> >> into a DataFrame using pandas.read_csv and you will immediately
> >> see the problem. R, by contrast, uses very little memory.
> >> - Performance: post-processing of Python objects results in poor
> >> performance. Also, for the actual parsing, anything regular
> >> expression based (like the loadtable effort over the summer,
> >> all apologies to those who worked on it), is doomed to
> >> failure. I think having a tool with a high degree of
> >> compatibility and intelligence for parsing unruly small files
> >> does make sense though, but it's not appropriate for large,
> >> well-behaved files.
> >> - Need to "factorize": as soon as there is an enum dtype in
> >> NumPy, we will want to enable the file parsers for structured
> >> arrays and DataFrame to be able to "factorize" / convert to
> >> enum certain columns (for example, all string columns) during
> >> the parsing process, and not afterward. This is very important
> >> for enabling fast groupby on large datasets and reducing
> >> unnecessary memory usage up front (imagine a column with a
> >> million values, with only 10 unique values occurring). This
> >> would be trivial to implement using a C hash table
> >> implementation like khash.h
> >> To be clear: I'm going to do this eventually whether or not it
> >> happens in NumPy because it's an existing problem for heavy
> >> pandas users. I see no reason why the code can't emit structured
> >> arrays, too, so we might as well have a common library component
> >> that I can use in pandas and specialize to the DataFrame internal
> >> structure.
> >> It seems clear to me that this work needs to be done at the
> >> lowest level possible, probably all in C (or C++?) or maybe
> >> Cython plus C utilities.
> >> If anyone wants to get involved in this particular problem right
> >> now, let me know!
> >> best,
> >> Wes
> > --
> > Erin Scott Sheldon
> > Brookhaven National Laboratory
Erin Scott Sheldon
Brookhaven National Laboratory
More information about the NumPy-Discussion