[Numpy-discussion] reading *big* inhomogenous text matrices *fast*?
Wed Aug 13 21:42:51 CDT 2008
On Wed, Aug 13, 2008 at 21:07, Daniel Lenski <email@example.com> wrote:
> On Wed, 13 Aug 2008 20:55:02 -0500, robert.kern wrote:
>>> This is similar to what I tried originally! Unfortunately, repeatedly
>>> appending to a list seems to be very slow... I guess Python keeps
>>> reallocating and copying the list as it grows. (It would be nice to be
>>> able to tune the increments by which the list size increases.)
>> The list reallocation schedule is actually fairly well-tuned as it is.
>> Appending to a list object should be amortized O(1) time.
> Hi Robert,
> This is very interesting! Do you know what exactly is the allocation
> strategy of list.append, or have a reference?
Here is the appropriate snippet in Objects/listobject.c:
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6) + newsize;
Raymond Hettinger had a good talk at PyCon this year about the details
of the Python containers. Here are the slides from the EuroPython
version (I assume).
>>> Repeatedly concatenating arrays with numpy.append or numpy.concatenate
>>> is also quite slow, unfortunately. :-(
>> Yes. There is no preallocation here.
> Makes sense. I'm sure this cuts down on the amount of metadata that has
> to be carried along with each array instance, and there aren't too many
> applications for it.
Primarily, it's the fact that we have views of arrays that might be
floating around that prevents us from reallocating as a matter of
course. Now, we do have a .resize() method which will explicitly
reallocate the array, but it will only work if you don't have any
views on the array floating around. During your file reading, this is
probably valid, so you may want to give it a try using a similar
reallocation strategy as lists. I'd be interested in seeing some
benchmarks comparing this strategy with the others.
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
More information about the Numpy-discussion