[Numpy-discussion] in the NA discussion, what can we agree on?
Nathaniel Smith
njs@pobox....
Fri Nov 4 16:29:03 CDT 2011
On Fri, Nov 4, 2011 at 1:22 PM, T J <tjhnson@gmail.com> wrote:
> I agree that it would be ideal if the default were to skip IGNORED values,
> but that behavior seems inconsistent with its propagation properties (such
> as when adding arrays with IGNORED values). To illustrate, when we did
> "x+2", we were stating that:
>
> IGNORED(2) + 2 == IGNORED(4)
>
> which means that we propagated the IGNORED value. If we were to skip them
> by default, then we'd have:
>
> IGNORED(2) + 2 == 2
>
> To be consistent, then it seems we also should have had:
>
>>>> x + 2
> [3, 2, 5]
>
> which I think we can agree is not so desirable. What this seems to come
> down to is that we tend to want different behavior when we are doing
> reductions, and that for IGNORED data, we want it to propagate in every
> situation except for a reduction (where we want to skip over it).
>
> I don't know if there is a well-defined way to distinguish reductions from
> the other operations. Would it hold for generalized ufuncs? Would it hold
> for other functions which might return arrays instead of scalars?
Continuing my theme of looking for consensus first... there are
obviously a ton of ugly corners in here. But my impression is that at
least for some simple cases, it's clear what users want:
>>> a = [1, IGNORED(2), 3]
# array-with-ignored-values + unignored scalar only affects unignored values
>>> a + 2
[3, IGNORED(2), 5]
# reduction operations skip ignored values
>>> np.sum(a)
4
For example, Gary mentioned the common idiom of wanting to take an
array and subtract off its mean, and he wants to do that while leaving
the masked-out/ignored values unchanged. As long as the above cases
work the way I wrote, we will have
>>> np.mean(a)
2
>>> a -= np.mean(a)
>>> a
[-1, IGNORED(2), 1]
Which I'm pretty sure is the result that he wants. (Gary, is that
right?) Also numpy.ma follows these rules, so that's some additional
evidence that they're reasonable. (And I think part of the confusion
between Lluís and me was that these are the rules that I meant when I
said "non-propagating", but he understood that to mean something
else.)
So before we start exploring the whole vast space of possible ways to
handle masked-out data, does anyone see any reason to consider rules
that don't have, as a subset, the ones above? Do other rules have any
use cases or user demand? (I *love* playing with clever mathematics
and making things consistent, but there's not much point unless the
end result is something that people will use :-).)
-- Nathaniel
More information about the NumPy-Discussion
mailing list