[Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

Tom Aldcroft aldcroft@head.cfa.harvard....
Fri Jul 13 11:13:38 CDT 2012


On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto
<mail@paul.kishimoto.name> wrote:
> Hello everyone,
>
>         I am a longtime NumPy user, and I just filed my first contribution to
> the code as pull request to fix what I felt was a bug in the behaviour
> of genfromtxt() https://github.com/numpy/numpy/pull/351
> It turns out this alters existing behaviour that some people may depend
> on, so I was encouraged to raise the issue on this list to see what the
> consensus was.
>
> This behaviour happens in the specific situation where:
>       * Comments are used in the file (the default comment character is
>         '#', which I'll use here), AND
>       * The kwarg names=True is given. In this case, genfromtxt() is
>         supposed to read an initial row containing the names of the
>         columns and return an array with a structured dtype.
>
> Currently, these options work with a file like (Example #1):
>
>         # gender age weight
>         M   21 72.100000
>         F   35  58.330000
>         M   33  21.99
>
> …but NOT with a file like (Example #2):
>
>         # here is a general file comment
>         # it is spread over multiple lines
>         gender age weight
>         M   21 72.100000
>         F   35  58.330000
>         M   33  21.99
>
> …genfromtxt() believes the column names are 'here', 'is', 'a', etc., and
> thinks all of the columns are strings because 'gender', 'age' and
> 'weight' are not numbers.
>
>         This is because genfromtxt() (after skipping a number of lines as
> specified in the optional kwarg skip_header) will use the *first* line
> it encounters to produce column names. If that line contains a comment
> character, genfromtxt() discards everything *up to and including* the
> comment character, and tries to use the content *after* the comment
> character as headers (Example 3):
>
>         gender age weight # wrong column names
>         M   21  72.100000
>         F   35  58.330000
>         M   33  21.99
>
> …the resulting column names are 'wrong', 'column' and 'names'.
>
> My proposed change was that, if the first (or any subsequent) line
> contains a comment character, it should be treated as an *actual
> comment*, and discarded along with anything that follows it on the line.
>
>         In Example 2, the result would be that the first two lines appear empty
> (no text before '#'), and the third line ("gender age weight") is used
> for column names.
>
>         In Example 3, the result would be that "gender age weight" is used for
> column names while "# wrong column names" is ignored.
>
> BUT!
>
>         In Example 1, the result would be that the first line appears empty,
> and "M   21  72.100000" are used for column names.
>
> In other words, this change would do away with the previous behaviour
> where the very first commented line was (magically?) treated not as a
> comment but instead as column headers. This might break some existing
> code. On the positive side, it would allow the user to be more liberal
> with the format of input files (Example 4):
>
>         # here is a general file comment
>         # the columns in this table are
>         gender age weight # here is a comment on the header line
>         # following this line are the data
>         M   21  72.100000
>         F   35  58.330000 # here is a comment on a data line
>         M   33  21.99
>
> I feel that this is a better/more flexible behaviour for genfromtxt(),
> but—as stated—I am interested in your thoughts.
>
> Cheers,
> --
> Paul Natsuo Kishimoto
>
> SM candidate, Technology & Policy Program (2012)
> Research assistant,  http://globalchange.mit.edu
> https://paul.kishimoto.name      +1 617 302 6105
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

Hi Paul,

At least in astronomy tabular files with the column definitions in the
first commented line are reasonably common.  This is driven in part by
wide use of legacy packages like supermongo etc that don't have
intelligent table readers, so users document the column names as a
comment line.  I think making this break might be unfortunate for
users in astronomy.

Dealing with commented header definitions is annoying.  Not that it
matters specifically for your genfromtext() proposal, but in the
asciitable reader this case is handled with a particular reader class
that expects the first comment line to contain the column definitions:

 http://cxc.harvard.edu/contrib/asciitable/#asciitable.CommentedHeader

Cheers,
Tom


More information about the NumPy-Discussion mailing list