[Numpy-discussion] Boolean arrays

Anne Archibald aarchiba@physics.mcgill...
Fri Aug 27 15:24:05 CDT 2010

On 27 August 2010 16:17, Robert Kern <robert.kern@gmail.com> wrote:
> On Fri, Aug 27, 2010 at 15:10, Ken Watford <kwatford+scipy@gmail.com> wrote:
>> On Fri, Aug 27, 2010 at 3:58 PM, Brett Olsen <brett.olsen@gmail.com> wrote:
>>> Hello,
>>>
>>> I have an array of non-numeric data, and I want to create a boolean
>>> array denoting whether each element in this array is a "valid" value
>>> or not.  This is straightforward if there's only one possible valid
>>> value:
>>>>>> import numpy as N
>>>>>> ar = N.array(("a", "b", "c", "b", "b", "a", "d", "c", "a"))
>>>>>> ar == "a"
>>> array([ True, False, False, False, False,  True, False, False,  True],
>>> dtype=bool)
>>>
>>> If there's multiple possible valid values, I've come up with a couple
>>> possible methods, but they all seem to be inefficient or kludges:
>>>>>> valid = N.array(("a", "c"))
>>>>>> (ar == valid[0]) | (ar == valid[1])
>>> array([ True, False,  True, False, False,  True, False,  True,  True],
>>> dtype=bool)
>>>>>> N.array(map(lambda x: x in valid, ar))
>>> array([ True, False,  True, False, False,  True, False,  True,  True],
>>> dtype=bool)
>>>
>>> Is there a numpy-appropriate way to do this?
>>>
>>> Thanks,
>>> Brett Olsen
>>
>> amap: Like Map, but for arrays.
>>
>>>>> ar = numpy.array(("a", "b", "c", "b", "b", "a", "d", "c", "a"))
>>>>> valid = ('a', 'c')
>>>>> numpy.amap(lambda x: x in valid, ar)
>> array([ True, False,  True, False, False,  True, False,  True,  True],
>> dtype=bool)
>
> I'm not sure what version of numpy this would be in; I've never seen it.
>
> But in any case, that would be very slow for large arrays since it
> would invoke a Python function call for every value in ar. Instead,
> iterate over the valid array, which is much shorter:
>
> for good in valid:
>    mask |= (ar == good)
>
> Wrap that up into a function and you're good to go. That's about as
> efficient as it gets unless if the valid array gets large.

The problem here is really one of how you specify which values are
valid. If your only specification is with a python function, then
you're stuck calling that python function once for each possible
value, no way around it. But it could happen that you have an array of
possible values and a corresponding boolean array that says whether
they're valid or not. Then there's a shortcut that's probably faster
than oring as Robert suggests:

In [3]: A = np.array([1,2,6,4,4,2,1,7,8,2,2,1])

In [4]: B = np.unique1d(A)

In [5]: B
Out[5]: array([1, 2, 4, 6, 7, 8])

Here C specifies which ones are valid. C could be computed using some
sort of validity function (which it may be possible to vectorize). In
any case it's only the distinct values, and they're sorted (so you can
use ranges).

In [6]: C = np.array([True,True,True,False,False,True])

Now to compute validity of A:

In [10]: C[np.searchsorted(B,A)]
Out[10]:
array([ True,  True, False,  True,  True,  True,  True, False,  True,
True,  True,  True], dtype=bool)

Anne

> --
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless
> enigma that is made terrible by our own mad attempt to interpret it as
> though it had an underlying truth."
>   -- Umberto Eco
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>