[Numpy-discussion] np.loadtxt : yet a new implementation...

Ryan May rmay31@gmail....
Mon Dec 1 15:23:18 CST 2008


Stéfan van der Walt wrote:
> Hi Pierre
> 
> 2008/12/1 Pierre GM <pgmdevlist@gmail.com>:
>> * `genloadtxt` is the base function that makes all the work. It
>> outputs 2 arrays, one for the data (missing values being substituted
>> by the appropriate default) and one for the mask. It would go in
>> np.lib.io
> 
> I see the code length increased from 200 lines to 800.  This made me
> wonder about the execution time: initial benchmarks suggest a 3x
> slow-down.  Could this be a problem for loading large text files?  If
> so, should we consider keeping both versions around, or by default
> bypassing all the extra hooks?

I've wondered about this being an issue.  On one hand, you hate to make 
existing code noticeably slower.  On the other hand, if speed is 
important to you, why are you using ascii I/O?

I personally am not entirely against having two versions of loadtxt-like 
functions.  However, the idea seems a little odd, seeing as how loadtxt 
was already supposed to be the "swiss army knife" of text reading.

I'm seeing a similar slowdown with Pierre's version of the code.  The 
version of loadtxt that I cobbled together with the StringConverter 
class (and no missing value support) shows about a 50% slowdown, so 
clearly there's a performance penalty for trying to make a generic 
function that can be all things to all people.  On the other hand, this 
approach reduces code duplication.

I'm not really opinionated on what the right approach is here.  My only 
opinion is that this functionality *really* needs to be in numpy in some 
fashion.  For my own use case, with the old version, I could read a text 
file and by hand separate out columns and mask values.  Now, I open a 
file and get a structured array with an automatically detected dtype 
(names and types!) plus masked values.

My $0.02.

Ryan

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma


More information about the Numpy-discussion mailing list