[Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

Paul Natsuo Kishimoto mail@paul.kishimoto.n...
Mon Jul 16 07:13:54 CDT 2012


Hi Pierre,

On Mon, 2012-07-16 at 01:54 -0500, Travis Oliphant wrote:
> On Jul 16, 2012, at 1:52 AM, Pierre GM wrote:
> 
> > Hello,
> > I'm siding w/ Tom, Nathaniel and Travis. I don't think the change as
> > it is is advisable. It's a regression, and breaking=bad.
> > Now, I can understand your frustration, so, what about a trade-off?
> > The first line w/ a comment after the first 'skip_header' ones
> > should be parsed for column titles (and we call it
> > 'first_commented_line'). We split it along the comment character,
> > say, #. If there's some non-space character before the #, we keep
> > this part of 'first_commented_line' as titles: that should work for
> > your case. If the first non-space character was #, then what comes
> > after are the titles (that's Tom's case and the current default).
> > I'm not looking forward to introducing yet another keyword,
> > genfromtxt is enough of a mess as it is (unless we add a
> > 'need_coffee' one).
> > What y'all think?

> That seems like an acceptable proposal --- it is consistent with
> current behavior but also satisfies the use-case (without another
> keyword which is a bonus). 

> So, 

> +1 from me.

> -Travis
> 
Thanks for jumping in, and for offering a compromise solution. I agree
that genfromtxt() has too many kwargs—it took me several minutes of
reading the docs to realize why it wasn't behaving as expected!

To be ultra clear (since I want to code this), you are suggesting that
'first_commented_line' be a *new* accepted value for the kwarg 'names',
to invoke the behaviour you suggest?

---

If this IS what you mean, I'd counter-propose something in the same
spirit, but a bit simpler…we let the kwarg 'skip_header' take some
additional value, say int(0), int(-1), str('auto'), or True. In this
case, instead of skipping a fixed number of lines, it will skip any
number of consecutive empty OR commented lines; THEN apply the behaviour
you describe.

The semantics of this are more intuitive, because this is what I am
really after: to *skip* a commented *header* of arbitrary length. So my
four examples below could be parsed with:

     1. genfromtxt(..., names=True)
     2. genfromtxt(..., names=True, skip_header=True)
     3. genfromtxt(..., names=True)
     4. genfromtxt(..., names=True, skip_header=True)

…crucially #1 avoids the regression.

Does this seem good to everyone?

---

But if this is NOT what you mean, then what you say does not actually
work with the simple use-case of my Example #2 below. The first
commented line is "# here is a..." with # as the first non-space
character, so the part after becomes the names 'here', 'is', 'a' etc.

In short, the code can't resolve the ambiguity without some extra
information from the user.
> 
> > On Jul 13, 2012 7:29 PM, "Paul Natsuo Kishimoto"
> > <mail@paul.kishimoto.name> wrote:
> >         On Fri, 2012-07-13 at 12:13 -0400, Tom Aldcroft wrote:
> >         > On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto
> >         > <mail@paul.kishimoto.name> wrote:
> >         > > Hello everyone,
> >         > >
> >         > >         I am a longtime NumPy user, and I just filed my
> >         first contribution to
> >         > > the code as pull request to fix what I felt was a bug in
> >         the behaviour
> >         > > of genfromtxt() https://github.com/numpy/numpy/pull/351
> >         > > It turns out this alters existing behaviour that some
> >         people may depend
> >         > > on, so I was encouraged to raise the issue on this list
> >         to see what the
> >         > > consensus was.
> >         > >
> >         > > This behaviour happens in the specific situation where:
> >         > >       * Comments are used in the file (the default
> >         comment character is
> >         > >         '#', which I'll use here), AND
> >         > >       * The kwarg names=True is given. In this case,
> >         genfromtxt() is
> >         > >         supposed to read an initial row containing the
> >         names of the
> >         > >         columns and return an array with a structured
> >         dtype.
> >         > >
> >         > > Currently, these options work with a file like (Example
> >         #1):
> >         > >
> >         > >         # gender age weight
> >         > >         M   21 72.100000
> >         > >         F   35  58.330000
> >         > >         M   33  21.99
> >         > >
> >         > > …but NOT with a file like (Example #2):
> >         > >
> >         > >         # here is a general file comment
> >         > >         # it is spread over multiple lines
> >         > >         gender age weight
> >         > >         M   21 72.100000
> >         > >         F   35  58.330000
> >         > >         M   33  21.99
> >         > >
> >         > > …genfromtxt() believes the column names are 'here',
> >         'is', 'a', etc., and
> >         > > thinks all of the columns are strings because 'gender',
> >         'age' and
> >         > > 'weight' are not numbers.
> >         > >
> >         > >         This is because genfromtxt() (after skipping a
> >         number of lines as
> >         > > specified in the optional kwarg skip_header) will use
> >         the *first* line
> >         > > it encounters to produce column names. If that line
> >         contains a comment
> >         > > character, genfromtxt() discards everything *up to and
> >         including* the
> >         > > comment character, and tries to use the content *after*
> >         the comment
> >         > > character as headers (Example 3):
> >         > >
> >         > >         gender age weight # wrong column names
> >         > >         M   21  72.100000
> >         > >         F   35  58.330000
> >         > >         M   33  21.99
> >         > >
> >         > > …the resulting column names are 'wrong', 'column' and
> >         'names'.
> >         > >
> >         > > My proposed change was that, if the first (or any
> >         subsequent) line
> >         > > contains a comment character, it should be treated as an
> >         *actual
> >         > > comment*, and discarded along with anything that follows
> >         it on the line.
> >         > >
> >         > >         In Example 2, the result would be that the first
> >         two lines appear empty
> >         > > (no text before '#'), and the third line ("gender age
> >         weight") is used
> >         > > for column names.
> >         > >
> >         > >         In Example 3, the result would be that "gender
> >         age weight" is used for
> >         > > column names while "# wrong column names" is ignored.
> >         > >
> >         > > BUT!
> >         > >
> >         > >         In Example 1, the result would be that the first
> >         line appears empty,
> >         > > and "M   21  72.100000" are used for column names.
> >         > >
> >         > > In other words, this change would do away with the
> >         previous behaviour
> >         > > where the very first commented line was (magically?)
> >         treated not as a
> >         > > comment but instead as column headers. This might break
> >         some existing
> >         > > code. On the positive side, it would allow the user to
> >         be more liberal
> >         > > with the format of input files (Example 4):
> >         > >
> >         > >         # here is a general file comment
> >         > >         # the columns in this table are
> >         > >         gender age weight # here is a comment on the
> >         header line
> >         > >         # following this line are the data
> >         > >         M   21  72.100000
> >         > >         F   35  58.330000 # here is a comment on a data
> >         line
> >         > >         M   33  21.99
> >         > >
> >         > > I feel that this is a better/more flexible behaviour for
> >         genfromtxt(),
> >         > > but—as stated—I am interested in your thoughts.
> >         > >
> >         > > Cheers,
> >         > > --
> >         > > Paul Natsuo Kishimoto
> >         > >
> >         > > SM candidate, Technology & Policy Program (2012)
> >         > > Research assistant,  http://globalchange.mit.edu
> >         > > https://paul.kishimoto.name      +1 617 302 6105
> >         > >
> >         > > _______________________________________________
> >         > > NumPy-Discussion mailing list
> >         > > NumPy-Discussion@scipy.org
> >         > > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >         > >
> >         >
> >         > Hi Paul,
> >         >
> >         > At least in astronomy tabular files with the column
> >         definitions in the
> >         > first commented line are reasonably common.  This is
> >         driven in part by
> >         > wide use of legacy packages like supermongo etc that don't
> >         have
> >         > intelligent table readers, so users document the column
> >         names as a
> >         > comment line.  I think making this break might be
> >         unfortunate for
> >         > users in astronomy.
> >         >
> >         > Dealing with commented header definitions is annoying.
> >          Not that it
> >         > matters specifically for your genfromtext() proposal, but
> >         in the
> >         > asciitable reader this case is handled with a particular
> >         reader class
> >         > that expects the first comment line to contain the column
> >         definitions:
> >         >
> >         >
> >          http://cxc.harvard.edu/contrib/asciitable/#asciitable.CommentedHeader
> >         >
> >         > Cheers,
> >         > Tom
> >         
> >         Tom,
> >         
> >         Thanks for this information. In thinking about how people
> >         would work
> >         around this, I figured it would be fairly easy to discard a
> >         comment
> >         character that occurred as the very first character in a
> >         file, e.g.:
> >         
> >                 raw = StringIO(open('example.txt').read()[1:])
> >                 data = numpy.genfromtxt(raw, comment='#',
> >         names=True)
> >         
> >         …but I realize that making this change in many places would
> >         still be an
> >         annoyance.
> >         
> >                 I should perhaps also add that my view of 'proper'
> >         table formats is
> >         partly influenced by another plotting package, namely
> >         pgfplots for LaTeX
> >         (http://pgfplots.sourceforge.net/ ,
> >         http://pgfplots.sourceforge.net/gallery.html) which uses
> >         uncommented
> >         headers. To the extent NumPy users are also LaTeX users,
> >         similar
> >         semantics could be more friendly.
> >         
> >         Looking forward to more input from other users,
> >         --
> >         Paul Natsuo Kishimoto
> >         
> >         SM candidate, Technology & Policy Program (2012)
> >         Research assistant,  http://globalchange.mit.edu
> >         https://paul.kishimoto.name      +1 617 302 6105

-- 
Paul Natsuo Kishimoto

SM candidate, Technology & Policy Program (2012)
Research assistant,  http://globalchange.mit.edu
http://paul.kishimoto.name       +1 617 302 6105
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
Url : http://mail.scipy.org/pipermail/numpy-discussion/attachments/20120716/fb5214eb/attachment.bin 


More information about the NumPy-Discussion mailing list