[Numpy-discussion] Possible roadmap addendum: building better text file readers

Warren Weckesser warren.weckesser@enthought....
Sun Feb 26 11:23:50 CST 2012


On Thu, Feb 23, 2012 at 2:19 PM, Warren Weckesser <
warren.weckesser@enthought.com> wrote:

>
> On Thu, Feb 23, 2012 at 2:08 PM, Travis Oliphant <travis@continuum.io>wrote:
>
>> This is actually on my short-list as well --- it just didn't make it to
>> the list.
>>
>> In fact, we have someone starting work on it this week.  It is his first
>> project so it will take him a little time to get up to speed on it, but he
>> will contact Wes and work with him and report progress to this list.
>>
>> Integration with np.loadtxt is a high-priority.  I think loadtxt is now
>> the 3rd or 4th "text-reading" interface I've seen in NumPy.  I have no
>> interest in making a new one if we can avoid it.   But, we do need to make
>> it faster with less memory overhead for simple cases like Wes describes.
>>
>> -Travis
>>
>>
>
> I have a "proof of concept" CSV reader written in C (with a Cython
> wrapper).  I'll put it on github this weekend.
>
> Warren
>
>

The text reader that I've been working on is now on github:
    https://github.com/WarrenWeckesser/textreader

Currently it makes two passes through the file.  The first pass just counts
the number of rows.  It then allocates the array and reads the file again
to parse the data and fill in the array.  Eventually the first pass wll be
optional, and you'll be able to specify how many rows to read (and then
continue reading another block if you haven't read the entire file).

You currently have to give the dtype as a structured array.  That would be
nice to fix.  Actually, there are quite a few "must have" features that it
doesn't have yet.

One issue that this code handles is newlines embedded in quoted fields.
Excel can generate and read files like this:

1.0,2.0,"foo
bar"

That is one "row" with three fields.  The third field contains "foo\nbar".

I haven't pushed it to the extreme, but the "big" example (in the examples/
directory) is a 1 gig text file with 2 million rows and 50 fields in each
row.  This is read in less than 30 seconds (but that's with a solid state
drive).

Quoting the README file: "This is experimental, unreleased software.  Use
at your own risk."  There are some hard-coded buffer sizes (that eventually
should be dynamic), and the error checking is not complete, so mistakes or
unanticipated cases can result in seg. faults.

Warren



>
>
>>
>> On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote:
>>
>> > Hi,
>> >
>> > 23.02.2012 20:32, Wes McKinney kirjoitti:
>> > [clip]
>> >> To be clear: I'm going to do this eventually whether or not it
>> >> happens in NumPy because it's an existing problem for heavy
>> >> pandas users. I see no reason why the code can't emit structured
>> >> arrays, too, so we might as well have a common library component
>> >> that I can use in pandas and specialize to the DataFrame internal
>> >> structure.
>> >
>> > If you do this, one useful aim could be to design the code such that it
>> > can be used in loadtxt, at least as a fast path for common cases. I'd
>> > really like to avoid increasing the number of APIs for text file
>> loading.
>> >
>> > --
>> > Pauli Virtanen
>> >
>> > _______________________________________________
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion@scipy.org
>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20120226/37409f07/attachment.html 


More information about the NumPy-Discussion mailing list