[Numpy-discussion] load from text files Pull Request Review

Derek Homeier derek@astro.physik.uni-goettingen...
Fri Sep 2 11:42:07 CDT 2011


On 02.09.2011, at 6:16PM, Christopher Jordan-Squire wrote:

> I hadn't thought of that. Interesting idea. I'm surprised that
> completely resetting the array could be faster.
> 
I had experimented a bit with the fromiter function, which also increases 
the output array as needed, and this creates negligible overhead compared
to parsing the text input (it is implemented in C, though, I don't know how 
the .resize() calls would compare to that; and unfortunately it's for 1D-arrays 
only).

>> In my tests, I'm pretty sure that the time spent file io and string
>> parsing swamp the time it takes to allocate memory and set the values.
> 
> In my tests, at least for a medium sized csv file (about 3000 rows by
> 30 columns), about 10% of the time was determine the types in the
> first read through and 90% of the time was sticking the data in the
> array.
> 
This would be consistent with my experience (basically testing for comment 
characters and the length of line.split(delimiter) in the first pass). 

> However, that particular test took more time for reading in because
> the data was quoted (so converting '"3,25"' to a float took between
> 1.5x and 2x as long as '3.25') and the datetime conversion is costly.
> 
> Regardless, that suggests making the data loading faster is more
> important than avoiding reading through the file twice. I guess that
> intuition probably breaks if the data doesn't fit until memory,
> though. But I haven't worked with extremely large data files before,
> so I'd appreciate refutation/confirmation of my priors.
> 
The lion's share in the data loading time, by my experience, is still the string 
operations (like the comma conversion you quote above), so I'd always 
expect any subsequent manipulations of the numpy array data to be very fast
compared to that. Maybe this changes slightly with more complex data types like 
string records or datetime instances, but as you indicate, even for those the 
conversion seems to dominate the cost. 

Cheers,
						Derek
--
----------------------------------------------------------------
Derek Homeier          Centre de Recherche Astrophysique de Lyon
ENS Lyon                                      46, Allée d'Italie
69364 Lyon Cedex 07, France                  +33 1133 47272-8894
----------------------------------------------------------------






More information about the NumPy-Discussion mailing list