[Numpy-discussion] How to set array values based on a condition?

Damian Eads eads@soe.ucsc....
Mon Mar 24 02:09:42 CDT 2008


Hi,

I am eager to implement the C version of the set_where function but 
would like to do so in a numpy-esque way. Having implemented several 
internal and released Python/C packages, I am familiar with the PyArray 
object and the PyArrayIterObject and the like. After looking through the 
code I noticed the @typ@ notation in the .src files, which looks like 
some kind of pseudo-template to allow ndarray methods to work over 
arbitrary data types. I surmise these files are fed through a shell or 
Python script to generate the actual code? In my own code, I've been 
using C++ very carefully to handle such generality but I understand that 
numpy is strictly written in C.

The set_where function is very similar to ndarray.__add__ in that its 
arguments can either be scalars or arrays (all arrays must be of the 
same shape, though). To get a sense of how a numpy operator is 
implemented can someone point me to the templatized array __add__ 
function? I tried searching through the files below (numpy/core/src) but 
I can't quite pin down where the add functionality is implemented. 
Seeing PyArray_ArgMax, I thought PyArray_XXX might have been the naming 
convention for numpy methods but the fact that PyArray_Add (or 
PyArray_Plus) is not defined makes me unsure.

arraymethods.c      multiarraymodule.c      _sortmodule.c.src
arrayobject.c       scalarmathmodule.c.src  ucsnarrow.c
arraytypes.inc.src  scalartypes.inc.src     ufuncobject.c
_isnan.c            _signbit.c              umathmodule.c.src

Please advise. Thank you.

Damian

-----------------------------------------------------
Damian Eads                             Ph.D. Student
Jack Baskin School of Engineering, UCSC        E2-381
1156 High Street
Santa Cruz, CA 95064    http://www.soe.ucsc.edu/~eads

Damian Eads wrote:
> Damian Eads wrote:
>> Anne Archibald wrote:
>>> On 23/03/2008, Damian Eads <eads@soe.ucsc.edu> wrote:
>>>> Hi,
>>>>
>>>>  I am working on a memory-intensive experiment with very large arrays so
>>>>  I must be careful when allocating memory. Numpy already supports a
>>>>  number of in-place operations (+=, *=) making the task much more
>>>>  manageable. However, it is not obvious to me out I set values based on a
>>>>  very simple condition.
>>>>
>>>>  The expression
>>>>
>>>>    y[y<0]=-1
>>>>
>>>>  generates a binary index mask y>=0 of the same size as the array y,
>>>>  which is problematic when y is quite large.
>>>>
>>>>  I was wondering if there was anything like a set_where(A, cmp, B,
>>>>  setval, [optional elseval]) function where cmp would be a comparison
>>>>  operator expressed as a string.
>>>>
>>>>  The code below illustrates what I want to do. Admittedly, it needs to be
>>>>  cleaned up but it's a proof of concept. Does numpy provide any functions
>>>>  that support the functionality of the code below?
>>> That's a good question, but I'm pretty sure it doesn't, apart from
>>> numpy.clip(). The way I'd try to solve that problem would be with the
>>> dreaded for loop. Don't iterate over single elements, but if you have
>>> a gargantuan array, working in chunks of ten thousand (or whatever)
>>> won't have too much overhead:
>>>
>>> block = 100000
>>> for n in arange(0,len(y),block):
>>>     yc = y[n:n+block]
>>>     yc[yc<0] = -1
>>>
>>> It's a bit of a pain, but working with arrays that nearly fill RAM
>>> *is* a pain, as I'm sure you are all too aware by now.
>>>
>>> You might look into numexpr, this is the sort of thing it does (though
>>> I've never used it and can't say whether it can do this).
>>>
>>> Anne
>>> _______________________________________________
>>> Numpy-discussion mailing list
>>> Numpy-discussion@scipy.org
>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>> Hi Anne,
>>
>> Since the thing I want to do is a common case, I figured that if I were 
>> to take the blocked-based approach, I'd write a helper function to do 
>> the blocking for me. Here it is:
>>
>> import numpy
>> import types
>>
>> def block_cond(*args):
>>      """
>>      block_cond(X1, ..., XN, cond_fun, val_fun, [else_fun])
>>
>>      Breaks the 1-D arrays X1 to XN into properly aligned chunks. The
>>      cond_fun is a function that takes in the chunks of each array
>>      returns an index or mask array. For each chunk c
>>
>>         C=cond_fun(X1[c], ..., XN[c])
>>
>>      The val_fun takes the masked or indexed chunks, and returns the
>>      values each element should be set to
>>
>>         V=cond_fun(X1[c][C], ..., XN[c][C])
>>
>>      Finally, the first array's elements
>>
>>         X1[c][C]=V
>>      """
>>      blksize = 100000
>>      if len(args) < 3:
>>          raise ValueError("Nothing to do.")
>>
>>      if type(args[-3]) == types.FunctionType:
>>          elsefn = args[-1]
>>          valfn = args[-2]
>>          condfn = args[-3]
>>          qargs = args[:-3]
>>      else:
>>          elsefn = None
>>          valfn = args[-1]
>>          condfn = args[-2]
>>          qargs = args[:-2]
>>
>>      # Grab the length of the first array.
>>      num = qargs[0].size
>>      shp = qargs[0].shape
>>
>>      # Check that rest of the arguments are all arrays of the same size.
>>      for i in xrange(0, len(qargs)):
>>          if type(qargs[i]) != _array_type:
>>              raise TypeError("Argument %i must be an array." % i)
>>          if qargs[i].size != num:
>>              raise TypeError("Array argument %i differs in size from the 
>> previous arrays." % i)
>>          if qargs[i].shape != shp:
>>              raise TypeError("Array argument %i differs in shape from 
>> the previous arrays." % i)
>>
>>      for a in xrange(0, num, blksize):
>>          b = min(a + blksize, num)
>>          fargs = [qarg[a:b] for qarg in qargs]
>>          c = apply(condfn, fargs)
>>          #print c
>>          v = apply(valfn, [farg[c] for farg in fargs])
>>          #print v
>>          slc = qargs[0][a:b]
>>          slc[c] = v
>>          if elsefn is not None:
>>              ev = apply(elsefn, [numpy.array(arg[a:b])[~c] for arg in 
>> qargs])
>>              slc[~c] = ev
>>
>> -----------------------------
>>
>> Let's try running it,
>>
>> In [96]: y=numpy.random.rand(10000000)
>>
>> In [97]: x=y.copy()
>>
>> In [98]: %time x[:] = x<=0.5
>> CPU times: user 0.39 s, sys: 0.01 s, total: 0.40 s
>> Wall time: 0.66 s
>>
>> In [100]: %time setwhere.block_cond(y, lambda y: y <= 0.5, lambda y: 1, 
>> lambda y: 0)
>> CPU times: user 1.70 s, sys: 0.10 s, total: 1.80 s
>> Wall time: 2.28 s
>>
>> The inefficient copying approach is almost 4 times faster than the 
>> blocking approach. Ideas about what I'm doing wrong?
>>
>> Would others find a proper C-based numpy implementation of the set_where 
>> function useful? I'd offer to implement it.
>>
>> Damian
> 
> If I try it with the scipy.weave implementation I showed in my first 
> posting of this thread, I get a factor of 3 speed up over the 
> memory-inefficient copy approach and a factor of 10 speed up over the 
> block-based approach.
> 
> In [105]: y=numpy.random.rand(10000000)
> 
> In [106]: %time setwhere.set_where(y, "<=", 0.5, 1, 0)
> CPU times: user 0.15 s, sys: 0.00 s, total: 0.15 s
> Wall time: 0.21 s
> 
> This suggests a C implementation might be worth it.
> 
> Damian
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion



More information about the Numpy-discussion mailing list