[Numpy-discussion] Array concatenation performance
Christopher Barker
Chris.Barker@noaa....
Fri Jul 16 14:33:44 CDT 2010
Sturla Molden wrote:
> Christopher Barker skrev:
>> However, Python lists hold python objects, so it's bit inefficient, at
>> least in terms of memory use.
>>
> I guess that is mainly in terms of memory use, as the Python (scalar)
> objects must be created for the call to append. np.array([]) can also be
> inefficient, as Anne explained yesterday, but an appendable ndarray
> would not have that problem. But I don't know how important it is to
> save a few milliseconds worth of cpu time for this.
not very.
> At least not enough
> to convince me to spend my time on it. The memory issue is more
> important though.
I agree.
> But arrays with O(1) amortized append are memory
> inefficient on their own.
well, yes, but by the amount that you buffer them -- I've seen reference
to methods that allocate twice as much memory as needed on each
expansion, so that would be a maximum of a 2X hit.
In poking around the python code, it looks like lists and array.arrays
over allocate by 1/16 (once the arrays are large, anyway). In my code,
It defaults to over allocating by 1/4. In my very limited testing, I
didn't see a performance gain by allocating more than that. So it's a
maximum of 25% memory hit -- and it's user configurable.
> So it might make more sence to save a list of
> equally long short buffers (e.g. ndarrays), and flatten to a contigous
> buffer when needed.
I did that in a C extension I wrote a while back for reading big text
files. I found that the size of the buffer hardly mattered to performance.
In my newer code, I found it a lot easier to over-allocate, and even if
you over allocate by 1/10 or 1/16, performance is still pretty good.
> Personally I am not convinced about an appendable array, neither speed
> nor memory wise, I just wanted to hear what other people think.
I still think it's a good idea -- though clearly not important enough
for me to have gone further with my code (though I am using it now).
For me, the reasons are memory and the ability to deal natively with
numpy data types that don't map directly to Python types. I think there
could be a real benefit to loadtxt() for instance.
> there is one more thing that makes me sceptical: We cannot use realloc()
> to grow an appendable array. That is because it would segfault if there
> are array views in use (the views might suddenly get a dangling data
> pointer). And prohibiting realloc(), the performance of an appendable
> array might not be fantastic
I used ndarray.resize, which I'm pretty sure uses realloc(). And that's
why I didn't subclass ndarray, but rather have an ndarray as the buffer,
which never gets a view on it. When slicing, a copy is returned.
But you're right this limits the utility, as it's not a first class
citizen of numpy -- you still use it like you do list -- accumulate in
it, then convert to a regular numpy array for further processing.
I'd love to see someone with more C chops than me pick this up -- but
Sturla's right -- it's hard to find the motivation for the relatively
small gains.
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
More information about the NumPy-Discussion
mailing list