[Numpy-discussion] loadtxt/savetxt tickets

Paul Anton Letnes paul.anton.letnes@gmail....
Sun Mar 27 05:09:18 CDT 2011


On 26. mars 2011, at 21.44, Derek Homeier wrote:

> Hi Paul,
> 
> having had a look at the other tickets you dug up,
> 
>> My opinions are my own, and in detail, they are:
>> 1752:
>>   I attach a possible patch. FWIW, I agree with the request. The  
>> patch is written to be compatible with the fix in ticket #1562, but  
>> I did not test that yet.
> 
> Tested, see also my comments on Trac.

Great!

>> 1731:
>>   This seems like a rather trivial feature enhancement. I attach a  
>> possible patch.
> 
> Agreed. Haven't tested it though.

Great!

>> 1616:
>>   The suggested patch seems reasonable to me, but I do not have a  
>> full list of what objects loadtxt supports today as opposed to what  
>> this patch will support.

Looks like you got this one. Just remember to make it compatible with #1752. Should be easy.

>> 1562:
>>   I attach a possible patch. This could also be the default  
>> behavior to my mind, since the function caller can simply call  
>> numpy.squeeze if needed. Changing default behavior would probably  
>> break old code, however.
> 
> See comments on Trac as well.

Your patch is better, but there is one thing I disagree with.
808    if X.ndim < ndmin:
809        if ndmin == 1:
810            X.shape = (X.size, )
811        elif ndmin == 2:
812            X.shape = (X.size, 1) 
The last line should be:
812            X.shape = (1, X.size) 
If someone wants a 2D array out, they would most likely expect a one-row file to come out as a one-row array, not the other way around. IMHO.

>> 1458:
>>   The fix suggested in the ticket seems reasonable, but I have  
>> never used record arrays, so I am not sure  of this.
> 
> There were some issues with Python3, and I also had some general  
> reservations
> as noted on Trac - basically, it makes 'unpack' equivalent to  
> transposing for 2D-arrays,
> but to splitting into fields for 1D-recarrays. My question was, what's  
> going to happen
> when you get to 2D-recarrays? Currently this is not an issue since  
> loadtxt can only
> read 2D regular or 1D structured arrays. But this might change if the  
> data block
> functionality (see below) were to be implemented - data could then be  
> returned as
> 3D arrays or 2D structured arrays... Still, it would probably make  
> most sense (or at
> least give the widest functionality) to have 'unpack=True' always  
> return a list or iterator
> over columns.

OK, I don't know recarrays, as I said.

>> 1445:
>>   Adding this functionality could break old code, as some old  
>> datafiles may have empty lines which are now simply ignored. I do  
>> not think the feature is a good idea. It could rather be implemented  
>> as a separate function.
>> 1107:
>>   I do not see the need for this enhancement. In my eyes, the  
>> usecols kwarg does this and more. Perhaps I am misunderstanding  
>> something here.
> 
> Agree about #1445, and the bit about 'usecols' - 'numcols' would just  
> provide a
> shorter call to e.g. read the first 20 columns of a file (well, not  
> even that much
> over 'usecols=range(20)'...), don't think that justifies an extra  
> argument.
> But the 'datablocks' provides something new, that a number of people  
> seem
> to miss from e.g. gnuplot (including me, actually ;-). And it would  
> also satisfy the
> request from #1445 without breaking backwards compatibility.
> I've been wondering if could instead specify the separator lines  
> through the
> parameter, e.g. "blocksep=['None', 'blank','invalid']", not sure if  
> that would make
> it more useful...

What about writing a separate function, e.g. loadblocktxt, and have it separate the chunks and call loadtxt for each chunk? Just a thought. Another possibility would be to write a function that would let you load a set of text files in a directory, and return a dict of datasets, one per file. One could write a similar save-function, too. They would just need to call loadtxt/savetxt on a per-file basis.

>> 1071:
>> 	It is not clear to me whether loadtxt is supposed to support  
>> missing values in the fashion indicated in the ticket.
> 
> In principle it should at least allow you to, by the use of converters  
> as described there.
> The problem is, the default delimiter is described as 'any  
> whitespace', which in the
> present implementation obviously includes any number of blanks or  
> tabs. These
> are therefore treated differently from delimiters like ',' or '&'. I'd  
> reckon there are
> too many people actually relying on this behaviour to silently change it
> (e.g. I know plenty of tables with columns separated by either one or  
> several
> tabs depending on the length of the previous entry). But the tab is  
> apparently also
> treated differently if explicitly specified with "delimiter='\t'" -  
> and in that case using
> a converter à la {2: lambda s: float(s or 'Nan')} is working for  
> fields in the middle of
> the line, but not at the end - clearly warrants improvement. I've  
> prepared a patch
> working for Python3 as well.

Great!

>> 1163:
>> 1565:
>>   These tickets seem to have the same origin of the problem. I  
>> attach one possible patch. The previously suggested patches that  
>> I've seen will not correctly convert floats to ints, which I believe  
>> my patch will.
> 
> +1, though I am a bit concerned that prompting to raise a ValueError  
> for every
> element could impede performance. I'd probably still enclose it into an
> if issubclass(typ, np.uint64) or issubclass(typ, np.int64):
> just like in npio.patch. I also thought one might switch to  
> int(float128(x)) in that
> case, but at least for the given examples float128 cannot convert with  
> more
> accuracy than float64 (even on PowerPC ;-).
> There were some dissenting opinions that trying to read a float into  
> an int should
> generally throw an exception though...
> 
> And Chuck just beat me...

I am sure someone has been using this functionality to convert floats to ints. Changing will break their code. I am not sure how big a deal that would be. Also, I am of the opinion that one should _first_ write a program that works _correctly_, and only afterwards worry about performance. Even so, using numpy.int64 would be better, because it would give a sensible error message.

> On 26 Mar 2011, at 21:25, Charles R Harris wrote:
> 
>> I put all these patches together at https://github.com/charris/numpy/tree/loadtxt-savetxt 
>> . Please pull from there to continue work on loadtxt/savetxt so as  
>> to avoid conflicts in the patches. One of the numpy tests is  
>> failing, I assume from patch conflicts, and more tests for the  
>> tickets are needed in any case. Also, new keywords should be added  
>> to the end, not put in the middle of existing keywords.
>> 
>> I haven't reviewed the patches, just tried to get them organized.  
>> Also, I have Derek as the author on all of them, that can be changed  
>> if it is decided the credit should go elsewhere ;) Thanks for the  
>> work you all have been doing on these tickets.
> 
> Thanks, I'll have a look at the new ticket and try to get that  
> organized!

With some luck, all the loadtxt tickets should be closed in short time :-)

Best,
Paul.


More information about the NumPy-Discussion mailing list