[Numpy-discussion] How to set array values based on a condition?

Damian Eads eads@soe.ucsc....
Sun Mar 23 16:42:46 CDT 2008


Anne Archibald wrote:
> On 23/03/2008, Damian Eads <eads@soe.ucsc.edu> wrote:
>> Hi,
>>
>>  I am working on a memory-intensive experiment with very large arrays so
>>  I must be careful when allocating memory. Numpy already supports a
>>  number of in-place operations (+=, *=) making the task much more
>>  manageable. However, it is not obvious to me out I set values based on a
>>  very simple condition.
>>
>>  The expression
>>
>>    y[y<0]=-1
>>
>>  generates a binary index mask y>=0 of the same size as the array y,
>>  which is problematic when y is quite large.
>>
>>  I was wondering if there was anything like a set_where(A, cmp, B,
>>  setval, [optional elseval]) function where cmp would be a comparison
>>  operator expressed as a string.
>>
>>  The code below illustrates what I want to do. Admittedly, it needs to be
>>  cleaned up but it's a proof of concept. Does numpy provide any functions
>>  that support the functionality of the code below?
> 
> That's a good question, but I'm pretty sure it doesn't, apart from
> numpy.clip(). The way I'd try to solve that problem would be with the
> dreaded for loop. Don't iterate over single elements, but if you have
> a gargantuan array, working in chunks of ten thousand (or whatever)
> won't have too much overhead:
> 
> block = 100000
> for n in arange(0,len(y),block):
>     yc = y[n:n+block]
>     yc[yc<0] = -1
> 
> It's a bit of a pain, but working with arrays that nearly fill RAM
> *is* a pain, as I'm sure you are all too aware by now.
> 
> You might look into numexpr, this is the sort of thing it does (though
> I've never used it and can't say whether it can do this).
> 
> Anne
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion

Hi Anne,

Since the thing I want to do is a common case, I figured that if I were 
to take the blocked-based approach, I'd write a helper function to do 
the blocking for me. Here it is:

import numpy
import types

def block_cond(*args):
     """
     block_cond(X1, ..., XN, cond_fun, val_fun, [else_fun])

     Breaks the 1-D arrays X1 to XN into properly aligned chunks. The
     cond_fun is a function that takes in the chunks of each array
     returns an index or mask array. For each chunk c

        C=cond_fun(X1[c], ..., XN[c])

     The val_fun takes the masked or indexed chunks, and returns the
     values each element should be set to

        V=cond_fun(X1[c][C], ..., XN[c][C])

     Finally, the first array's elements

        X1[c][C]=V
     """
     blksize = 100000
     if len(args) < 3:
         raise ValueError("Nothing to do.")

     if type(args[-3]) == types.FunctionType:
         elsefn = args[-1]
         valfn = args[-2]
         condfn = args[-3]
         qargs = args[:-3]
     else:
         elsefn = None
         valfn = args[-1]
         condfn = args[-2]
         qargs = args[:-2]

     # Grab the length of the first array.
     num = qargs[0].size
     shp = qargs[0].shape

     # Check that rest of the arguments are all arrays of the same size.
     for i in xrange(0, len(qargs)):
         if type(qargs[i]) != _array_type:
             raise TypeError("Argument %i must be an array." % i)
         if qargs[i].size != num:
             raise TypeError("Array argument %i differs in size from the 
previous arrays." % i)
         if qargs[i].shape != shp:
             raise TypeError("Array argument %i differs in shape from 
the previous arrays." % i)

     for a in xrange(0, num, blksize):
         b = min(a + blksize, num)
         fargs = [qarg[a:b] for qarg in qargs]
         c = apply(condfn, fargs)
         #print c
         v = apply(valfn, [farg[c] for farg in fargs])
         #print v
         slc = qargs[0][a:b]
         slc[c] = v
         if elsefn is not None:
             ev = apply(elsefn, [numpy.array(arg[a:b])[~c] for arg in 
qargs])
             slc[~c] = ev

-----------------------------

Let's try running it,

In [96]: y=numpy.random.rand(10000000)

In [97]: x=y.copy()

In [98]: %time x[:] = x<=0.5
CPU times: user 0.39 s, sys: 0.01 s, total: 0.40 s
Wall time: 0.66 s

In [100]: %time setwhere.block_cond(y, lambda y: y <= 0.5, lambda y: 1, 
lambda y: 0)
CPU times: user 1.70 s, sys: 0.10 s, total: 1.80 s
Wall time: 2.28 s

The inefficient copying approach is almost 4 times faster than the 
blocking approach. Ideas about what I'm doing wrong?

Would others find a proper C-based numpy implementation of the set_where 
function useful? I'd offer to implement it.

Damian


More information about the Numpy-discussion mailing list