[Numpy-discussion] loadtxt/savetxt tickets

Bruce Southey bsouthey@gmail....
Thu Mar 31 10:03:11 CDT 2011


On Wed, Mar 30, 2011 at 9:53 PM, Charles R Harris
<charlesr.harris@gmail.com> wrote:
>
>
> On Sun, Mar 27, 2011 at 4:09 AM, Paul Anton Letnes
> <paul.anton.letnes@gmail.com> wrote:
>>
>> On 26. mars 2011, at 21.44, Derek Homeier wrote:
>>
>> > Hi Paul,
>> >
>> > having had a look at the other tickets you dug up,
>> >
[snip]
>>
>> >> 1071:
>> >>      It is not clear to me whether loadtxt is supposed to support
>> >> missing values in the fashion indicated in the ticket.
>> >
>> > In principle it should at least allow you to, by the use of converters
>> > as described there.
>> > The problem is, the default delimiter is described as 'any
>> > whitespace', which in the
>> > present implementation obviously includes any number of blanks or
>> > tabs. These
>> > are therefore treated differently from delimiters like ',' or '&'. I'd
>> > reckon there are
>> > too many people actually relying on this behaviour to silently change it
>> > (e.g. I know plenty of tables with columns separated by either one or
>> > several
>> > tabs depending on the length of the previous entry). But the tab is
>> > apparently also
>> > treated differently if explicitly specified with "delimiter='\t'" -
>> > and in that case using
>> > a converter à la {2: lambda s: float(s or 'Nan')} is working for
>> > fields in the middle of
>> > the line, but not at the end - clearly warrants improvement. I've
>> > prepared a patch
>> > working for Python3 as well.
>>
>> Great!
>>
This is an invalid ticket because the docstring clearly states that in
3 different, yet critical places, that missing values are not handled
here:

"Each row in the text file must have the same number of values."
"genfromtxt : Load data with missing values handled as specified."
 "   This function aims to be a fast reader for simply formatted files.  The
    `genfromtxt` function provides more sophisticated handling of, e.g.,
    lines with missing values."

Really I am trying to separate the usage of loadtxt and genfromtxt to
avoid unnecessary duplication and confusion. Part of this is
historical because loadtxt was added in 2007 and genfromtxt was added
in 2009. So really certain features of loadtxt have been  'kept' for
backwards compatibility purposes yet these features can be 'abused' to
handle missing data. But I really consider that any missing values
should cause loadtxt to fail.

The patch is incorrect because it should not include a space in the
split() as indicated in the comment by the original reporter. Of
course a corrected patch alone still is not sufficient to address the
problem without the user providing the correct converter. Also you
start to run into problems with multiple delimiters (such as one space
versus two spaces) so you start down the path to add all the features
that duplicate genfromtxt.


Bruce


More information about the NumPy-Discussion mailing list