[Numpy-discussion] Question about improving genfromtxt errors

Skipper Seabold jsseabold@gmail....
Wed Sep 30 12:44:21 CDT 2009

On Wed, Sep 30, 2009 at 12:56 PM, Bruce Southey <bsouthey@gmail.com> wrote:
> On 09/30/2009 10:22 AM, Skipper Seabold wrote:
>> On Tue, Sep 29, 2009 at 4:36 PM, Bruce Southey<bsouthey@gmail.com>  wrote:
>> <snip>
>>> Hi,
>>> The first case just has to handle a missing delimiter - actually I expect
>>> that most of my cases would relate this. So here is simple Python code to
>>> generate arbitrary large list with the occasional missing delimiter.
>>> I set it so it reads the desired number of rows and frequency of bad rows
>>> from the linux command line.
>>> $time python tbig.py 1000000 100000
>>> If I comment out the extra prints in io.py that I put in, it takes about 22
>>> seconds to finish if the delimiters are correct. If I have the missing
>>> delimiter it takes 20.5 seconds to crash.
>>> Bruce
>> I think this would actually cover most of the problems I was running
>> into.  The only other one I can think of is when I used a converter
>> that I thought would work, but it got unexpected data.  For example,
>> from StringIO import StringIO
>> import numpy as np
>> strip_rand = lambda x : float(('r' in x.lower() and x.split()[-1]) or
>> (not 'r' in x.lower() and x.strip() or 0.0))
>> # Example usage
>> strip_rand('R 40')
>> strip_rand('  ')
>> strip_rand('')
>> strip_rand('40')
>> strip_per = lambda x : float(('%' in x.lower() and x.split()[0]) or
>> (not '%' in x.lower() and x.strip() or 0.0))
>> # Example usage
>> strip_per('7 %')
>> strip_per('7')
>> strip_per(' ')
>> strip_per('')
>> # Unexpected usage
>> strip_per('R 1')
> Does this work for you?
> I get an:
> ValueError: invalid literal for float(): R 1

No, that's the idea.  Sorry this was a bit opaque.

>> s = StringIO('D01N01,10/1/2003 ,1 %,R 75,400,600\r\nL24U05,12/5/2003\
>> ,2 %,1,300, 150.5\r\nD02N03,10/10/2004 ,R 1,,7,145.55')
> Can you provide the correct line before the bad line?
> It just makes it easy to understand why a line is bad.

The idea is that I have a column, which I expect to be percentages,
but these are coded in by different data collectors, so some code a 0
for 0, some just leave it missing which could just as well be 0, some
use the %.  What I didn't expect was that some put in a money amount,
hence the 'R 7', which my converter doesn't catch.

>> data = np.genfromtxt(s, converters = {2 : strip_per, 3 : strip_rand},
>> delimiter=",", dtype=None)
>> I don't have a clean install right now, but I think this returned a
>> converter is locked for upgrading error.  I would just like to know
>> where the problem occured (line and column, preferably not
>> zero-indexed), so I can go and have a look at my data.
> I rather limited understanding here. I think the problem is that Python
> is raising a ValueError because your strip_per() is wrong. It is not
> informative to you because _iotools.py is not aware that an invalid
> converter will raise a ValueError. Therefore there needs to be some way
> to test that the converter is correct or not.

_iotools does catch this I believe, though I don't understand the
upgrading and locking properly.  The kludgy fix that I provided in the
first post "I do not report the error from
_iotools.StringConverter...", catches that an error is raised from
_iotools and tells me exactly where the converter fails, so I can go
to, say line 750,000 column 250 (and converter with key 249) instead
of not knowing anything except that one of my ~500 converters failed
somewhere in a 1 million line data file.  If you still want to keep
the error messages from _iotools.StringConverter, then they maybe they
could have a (%s, %s) added and then this can be filled in in
genfromtxt when you know (line, column) or something similar as was
kind of suggested in a post in this thread I believe.  Then again,
this might not be possible.  I haven't tried.

> This this case I think it is the delimiter so checking the column
> numbers should occur before the application of the converter to that row.

Sometimes it was the case where I had an extra comma in a number 1,000
say and then the converter tried to work on the wrong column, and
sometimes it was because my converter didn't cover every use case,
because I didn't know it yet.  Either way, I just needed a gentle
nudge in the right direction.

If that doesn't clear up what I was after, I can try to provide a more
detailed code sample.


More information about the NumPy-Discussion mailing list