[SciPy-user] How to start with SciPy and NumPy

David Cournapeau cournape@gmail....
Mon Jan 26 09:11:35 CST 2009


On Mon, Jan 26, 2009 at 11:52 PM, Sturla Molden <sturla@molden.no> wrote:
>> Sturla Molden wrote:
>
>> This is independent of the list implementation, isn't it ? I am quite
>> curious to understand how you could get O(1) complexity for a "growable"
>> container: if you don't know in advance the number of items, and you add
>> O(N) items, how come you can get O(1) complexity ?
>
> Each time the array is re-sized, you add in some extra empty slots. Make
> sure the number of extra slots is proportional to the size of the array.

So this is the well known method of over allocating when you need to
grow, right ? This is not constant time, and depends on the number of
items you are adding I think (sublinearly, but still)

>
>> This may be true for one dimensional array, but generally, I think numpy
>> array performances come a lot from any item being reachable directly
>> from its 'coordinates' (this plus using native types instead of python
>> objects of course).
>
> An ndarrays data is a "one-dimensional array" of bytes.

Yes, it is one segment of memory, but there is more to that - the
interpretation of that data buffer (from the dtype): the fact that
each item has the exact same size, and that the array is not "ragged"
has big impact on performances as well I think.

>
> If you jump back and forth using arr[i,j] the speed will depend on the
> size of the array. If it is to big to fit in cache this may be slow, if it
> is small enough this will be fast.

Yes, but that's not the only factor: the dependency to the array's
size is only a limitation of the hardware. From a number of operations
POV, the coordinates are the only thing needed: a[i, j, k, ...] is
translated directly to the coordinate of the 1d buffer using the
strides info. This is really specifig to arrays, I would say.

Cache is obviously significant, but this is a consequence of the lack
of multiple indirection, itself a consequence of array being a set of
homogenous items. If the items all have difference sizes, you can't
know directly the number of bytes to jump directly, and you will need
indirection which breaks locality of your data - I think that's how
array-based lists work, the array is just the address of the items,
whereas for arrays, the address is the item.

cheers,

David


More information about the SciPy-user mailing list