[Numpy-discussion] load from text files Pull Request Review

Derek Homeier derek@astro.physik.uni-goettingen...
Fri Sep 2 11:17:41 CDT 2011


On 02.09.2011, at 5:50PM, Chris.Barker wrote:

> hmmm -- it seems you could jsut as well be building the array as you go, 
> and if you hit a change in the imput, re-set and start again.
> 
> In my tests, I'm pretty sure that the time spent file io and string 
> parsing swamp the time it takes to allocate memory and set the values.
> 
> So there is little cost, and for the common use case, it would be faster 
> and cleaner.
> 
> There is a chance, of course, that you might have to re-wind and start 
> over more than once, but I suspect that that is the rare case.
> 
I still haven't studied your class in detail, but one could probably actually 
just create a copy of the array read in so far, e.g. changing it from a 
dtype=[('f0', '<i8'), ('f1', '<f8')] to dtype=[('f0', '<f8'), ('f1', '<f8')]  as required - 
or even first implement it as a list or dict of arrays, that could be individually 
changed and only create a record array from that at the end. 
The required copying and extra memory use would definitely pale compared 
to the text parsing or the current memory usage for the input list. 
In my loadtxt version [https://github.com/numpy/numpy/pull/144] just parsing 
the text for comment lines adds ca. 10% time, while any of the array allocation 
and copying operations should at most be at the 1% level.
> 
>> enable automatic decompression (given the modularity, could you simply
>> use np.lib._datasource.open() like genfromtxt?)
> 
> I _think_this would benefit from a one-pass solution as well -- so you 
> don't need to de-compress twice.

Absolutely; on compressed data the time for the extra pass jumps up to +30-50%.

Cheers,
							Derek
--
----------------------------------------------------------------
Derek Homeier          Centre de Recherche Astrophysique de Lyon
ENS Lyon                                      46, Allée d'Italie
69364 Lyon Cedex 07, France                  +33 1133 47272-8894
----------------------------------------------------------------






More information about the NumPy-Discussion mailing list