[Numpy-discussion] Proposed change in genfromtxt(..., comments='#', names=True) behaviour

Paul Natsuo Kishimoto mail@paul.kishimoto.n...
Fri Jul 13 10:15:11 CDT 2012


Hello everyone,

	I am a longtime NumPy user, and I just filed my first contribution to
the code as pull request to fix what I felt was a bug in the behaviour
of genfromtxt() https://github.com/numpy/numpy/pull/351
It turns out this alters existing behaviour that some people may depend
on, so I was encouraged to raise the issue on this list to see what the
consensus was.

This behaviour happens in the specific situation where:
      * Comments are used in the file (the default comment character is
        '#', which I'll use here), AND
      * The kwarg names=True is given. In this case, genfromtxt() is
        supposed to read an initial row containing the names of the
        columns and return an array with a structured dtype.

Currently, these options work with a file like (Example #1):

        # gender age weight
        M   21  72.100000
        F   35  58.330000
        M   33  21.99

…but NOT with a file like (Example #2):

        # here is a general file comment
        # it is spread over multiple lines
        gender age weight
        M   21  72.100000
        F   35  58.330000
        M   33  21.99

…genfromtxt() believes the column names are 'here', 'is', 'a', etc., and
thinks all of the columns are strings because 'gender', 'age' and
'weight' are not numbers.

	This is because genfromtxt() (after skipping a number of lines as
specified in the optional kwarg skip_header) will use the *first* line
it encounters to produce column names. If that line contains a comment
character, genfromtxt() discards everything *up to and including* the
comment character, and tries to use the content *after* the comment
character as headers (Example 3):

        gender age weight # wrong column names
        M   21  72.100000
        F   35  58.330000
        M   33  21.99

…the resulting column names are 'wrong', 'column' and 'names'.

My proposed change was that, if the first (or any subsequent) line
contains a comment character, it should be treated as an *actual
comment*, and discarded along with anything that follows it on the line.

	In Example 2, the result would be that the first two lines appear empty
(no text before '#'), and the third line ("gender age weight") is used
for column names.

	In Example 3, the result would be that "gender age weight" is used for
column names while "# wrong column names" is ignored.

BUT!

	In Example 1, the result would be that the first line appears empty,
and "M   21  72.100000" are used for column names.

In other words, this change would do away with the previous behaviour
where the very first commented line was (magically?) treated not as a
comment but instead as column headers. This might break some existing
code. On the positive side, it would allow the user to be more liberal
with the format of input files (Example 4):

        # here is a general file comment
        # the columns in this table are
        gender age weight # here is a comment on the header line
        # following this line are the data
        M   21  72.100000
        F   35  58.330000 # here is a comment on a data line
        M   33  21.99

I feel that this is a better/more flexible behaviour for genfromtxt(),
but—as stated—I am interested in your thoughts.

Cheers,
-- 
Paul Natsuo Kishimoto

SM candidate, Technology & Policy Program (2012)
Research assistant,  http://globalchange.mit.edu
https://paul.kishimoto.name      +1 617 302 6105
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
Url : http://mail.scipy.org/pipermail/numpy-discussion/attachments/20120713/28063030/attachment.bin 


More information about the NumPy-Discussion mailing list