[Numpy-discussion] numarray.where confusion

Francesc Alted falted at pytables.org
Thu May 27 00:47:04 CDT 2004

A Dimecres 26 Maig 2004 21:01, Perry Greenfield va escriure:
> correct. You'd have to break apart the m1 tuple and
> index all the components, e.g.,
> m11, m12 = m1
> x[m11[m2],m12[m2]] = ...
> This gets clumsier with the more dimensions that must
> be handled, but you still can do it. It would be most
> useful if the indexed array is very large, the number
> of items selected is relatively small and one
> doesn't want to incur the memory overhead of all the
> mask arrays of the admittedly much nicer notational
> approach that Francesc illustrated.

Well, boolean arrays have the property that they use very little memory
(only 1 byte / element), and normally perform quite well doing indexing.
Some timings:

>>> import timeit
>>> t1 = timeit.Timer("m1=where(x>4);m2=where(x[m1]<7);m11,m12=m1;x[m11[m2],m12[m2]]","from numarray import arange,where;dim=3;x=arange(dim*dim);x.shape=(dim,dim)")
>>> t2 = timeit.Timer("x[(x>4) & (x<7)]","from numarray import arange,where;dim=3;x=arange(dim*dim);x.shape=(dim,dim)")
>>> t1.repeat(3,1000)
[3.1320240497589111, 3.1235389709472656, 3.1198310852050781]
>>> t2.repeat(3,1000)
[1.1218469142913818, 1.117638111114502, 1.1156759262084961]

i.e. using boolean arrays for indexing is roughly 3 times faster.

For larger arrays this difference is even more noticeable:

>>> t3 = timeit.Timer("m1=where(x>4);m2=where(x[m1]<7);m11,m12=m1;x[m11[m2],m12[m2]]","from numarray import arange,where;dim=1000;x=arange(dim*dim);x.shape=(dim,dim)")
>>> t4 = timeit.Timer("x[(x>4) & (x<7)]","from numarray import arange,where;dim=1000;x=arange(dim*dim);x.shape=(dim,dim)")
>>> t3.repeat(3,10)
[3.1818649768829346, 3.20477294921875, 3.190640926361084]
>>> t4.repeat(3,10)
[0.42328095436096191, 0.42140507698059082, 0.41979002952575684]

as you see, now the difference is almost an order of magnitude (!).

So, perhaps assuming the small memory overhead, in most of cases it is
better to use boolean selections. However, it would be nice to know the
ultimate reason of why this happens, because the Perry approach seems
intuitively faster.

Francesc Alted

More information about the Numpy-discussion mailing list