[Numpy-discussion] np.loadtxt : yet a new implementation...
Mon Dec 1 15:23:18 CST 2008
Stéfan van der Walt wrote:
> Hi Pierre
> 2008/12/1 Pierre GM <email@example.com>:
>> * `genloadtxt` is the base function that makes all the work. It
>> outputs 2 arrays, one for the data (missing values being substituted
>> by the appropriate default) and one for the mask. It would go in
> I see the code length increased from 200 lines to 800. This made me
> wonder about the execution time: initial benchmarks suggest a 3x
> slow-down. Could this be a problem for loading large text files? If
> so, should we consider keeping both versions around, or by default
> bypassing all the extra hooks?
I've wondered about this being an issue. On one hand, you hate to make
existing code noticeably slower. On the other hand, if speed is
important to you, why are you using ascii I/O?
I personally am not entirely against having two versions of loadtxt-like
functions. However, the idea seems a little odd, seeing as how loadtxt
was already supposed to be the "swiss army knife" of text reading.
I'm seeing a similar slowdown with Pierre's version of the code. The
version of loadtxt that I cobbled together with the StringConverter
class (and no missing value support) shows about a 50% slowdown, so
clearly there's a performance penalty for trying to make a generic
function that can be all things to all people. On the other hand, this
approach reduces code duplication.
I'm not really opinionated on what the right approach is here. My only
opinion is that this functionality *really* needs to be in numpy in some
fashion. For my own use case, with the old version, I could read a text
file and by hand separate out columns and mask values. Now, I open a
file and get a structured array with an automatically detected dtype
(names and types!) plus masked values.
Graduate Research Assistant
School of Meteorology
University of Oklahoma
More information about the Numpy-discussion