[Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

Tom Aldcroft aldcroft@head.cfa.harvard....
Mon Jul 16 15:00:32 CDT 2012


On Mon, Jul 16, 2012 at 3:06 PM, Paul Natsuo Kishimoto
<mail@paul.kishimoto.name> wrote:
> I've implemented this feature with skip_header=-1 as suggested by
> Pierre, and in doing so removed the regression. TravisBot seems to like
> it: https://github.com/numpy/numpy/pull/351
>
> On Mon, 2012-07-16 at 16:12 +0200, Pierre GM wrote:
>>         To be ultra clear (since I want to code this), you are
>>         suggesting that
>>         'first_commented_line' be a *new* accepted value for the kwarg
>>         'names', to invoke the behaviour you suggest?
>>
>>
>>
>> Nope, I was just referring to some hypothetical variable name. I meant
>> that:
>>
>> first_values = None
>> try:
>>     while not first_values:
>>         first_line = fhd.next()
>>         if names is True:
>>             parsed = [m for m in first_line.split(comments) if
>> m.strip()]
>>             if parsed:
>>                 first_value = split_line(parsed[0])
>>         else:
>>             ...
>>
>> (it's not tested, I'm writing it as it comes. And I didn't even use
>> the `first_commented_line` name, sorry)
>>
>>
>>         If this IS what you mean, I'd counter-propose something in the
>>         same spirit, but a bit simpler…we let the kwarg 'skip_header'
>>         take some additional value, say int(0), int(-1), str('auto'),
>>         or True.
>>
>>
>>
>>
>>         In this case, instead of skipping a fixed number of lines, it
>>         will skip any number of consecutive empty OR commented lines;
>>
>>
>>
>>
>> I really like the idea of having `skip_header=-1` skip all the empty
>> or commented lines (that is, lines whose first non-space character is
>> the `comments` character). That'd be rather convenient.
>>
>>
>>
>>
>>         The semantics of this are more intuitive, because this is what
>>         I am
>>         really after: to *skip* a commented *header* of arbitrary
>>         length. So my four examples below could be parsed with:
>>
>>         1. genfromtxt(..., names=True)
>>         2. genfromtxt(..., names=True, skip_header=True)
>>         3. genfromtxt(..., names=True)
>>         4. genfromtxt(..., names=True, skip_header=True)
>>
>>         …crucially #1 avoids the regression.
>>
>>
>>         Does this seem good to everyone?
>>
>>
>>
>>
>> Sounds good w/ `skip_header=-1`
>>
>>
>>         But if this is NOT what you mean, then what you say does not
>>         actually work with the simple use-case of my Example #2 below.
>>         The first commented line is "# here is a..." with # as the
>>         first non-space character, so the part after becomes the names
>>         'here', 'is', 'a' etc.
>>
>>
>>
>>
>> In that case, you could always use `skip_header=2`
>>
>>         In short, the code can't resolve the ambiguity without some
>>         extra
>>         information from the user.
>>
>>
>>
>>
>> It's always best not to let the code guess too much anyway...
>>
>> Well, no regression, and you have a nice plan. I'm for it.
>> Anybody else?
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
> --
> Paul Natsuo Kishimoto
>
> SM candidate, Technology & Policy Program (2012)
> Research assistant,  http://globalchange.mit.edu
> https://paul.kishimoto.name      +1 617 302 6105
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

I think that the proposed solution is OK, but it does make it even
trickier for the average user to predict the behavior of genfromtxt()
for different situations.  Perhaps as part of this pull request Paul
should also update the documentation to include a section describing
this behavior and usage with examples 1 to 4.

- Tom


More information about the NumPy-Discussion mailing list