[Numpy-discussion] in the NA discussion, what can we agree on?

Nathaniel Smith njs@pobox....
Fri Nov 4 17:38:48 CDT 2011


On Fri, Nov 4, 2011 at 3:08 PM, T J <tjhnson@gmail.com> wrote:
> On Fri, Nov 4, 2011 at 2:29 PM, Nathaniel Smith <njs@pobox.com> wrote:
>> Continuing my theme of looking for consensus first... there are
>> obviously a ton of ugly corners in here. But my impression is that at
>> least for some simple cases, it's clear what users want:
>>
>> >>> a = [1, IGNORED(2), 3]
>> # array-with-ignored-values + unignored scalar only affects unignored
>> values
>> >>> a + 2
>> [3, IGNORED(2), 5]
>> # reduction operations skip ignored values
>> >>> np.sum(a)
>> 4
>>
>> For example, Gary mentioned the common idiom of wanting to take an
>> array and subtract off its mean, and he wants to do that while leaving
>> the masked-out/ignored values unchanged. As long as the above cases
>> work the way I wrote, we will have
>>
>> >>> np.mean(a)
>> 2
>> >>> a -= np.mean(a)
>> >>> a
>> [-1, IGNORED(2), 1]
>>
>> Which I'm pretty sure is the result that he wants. (Gary, is that
>> right?) Also numpy.ma follows these rules, so that's some additional
>> evidence that they're reasonable. (And I think part of the confusion
>> between Lluís and me was that these are the rules that I meant when I
>> said "non-propagating", but he understood that to mean something
>> else.)
>>
>> So before we start exploring the whole vast space of possible ways to
>> handle masked-out data, does anyone see any reason to consider rules
>> that don't have, as a subset, the ones above? Do other rules have any
>> use cases or user demand? (I *love* playing with clever mathematics
>> and making things consistent, but there's not much point unless the
>> end result is something that people will use :-).)
>
> I guess I'm just confused on how one, in principle, would distinguish the
> various forms of propagation that you are suggesting (ie for reductions).

Well, numpy.ma does work this way, so certainly it's possible to do.
At the code level, np.add() and np.add.reduce() are different entry
points and can behave differently.

OTOH, it might be that it's impossible to do *while still maintaining
other things we care about*... but in that case we should just shake
our fists at the mathematics and then give up, instead of coming up
with an elegant system that isn't actually useful. So that's why I
think we should figure out what's useful first.

> I also don't think it is good that we lack commutativity.  If we disallow
> unignoring, then yes, I agree that what you wrote above is what people
> want.  But if we are allowed to unignore, then I do not.

I *think* that for the no-unignoring (also known as "MISSING") case,
we have a pretty clear consensus that we want something like:

>>> a + 2
[3, MISSING, 5]
>>> np.sum(a)
MISSING
>>> np.sum(a, skip_MISSING=True)
4

(Please say if you disagree, but I really hope you don't!) This case
is also easier, because we don't even have to allow a skip_MISSING
flag in cases where it doesn't make sense (e.g., unary or binary
operations) -- it's a convenience feature, so no-one will care if it
only works when it's useful ;-).

The use case that we're still confused about is specifically the one
where people want to *temporarily* hide parts of their data, do some
calculations that ignore those parts of their data, and then unhide
that data again -- e.g., see Gary's first post in this thread. So for
this use case, allowing unignore is definitely important, and having
np.sum() return IGNORED seems pretty useless to me. (When an operation
involves actually missing data, then you need to stop and think what
would be a statistically meaningful way to handle that -- sometimes
it's skip_MISSING, sometimes something else. So np.sum returning
MISSING is useful - it tells you something you might not have
realized. If you just ignored some data because you want to ignore
that data, then having np.sum return IGNORED is useless, because it
tells you something you already knew perfectly well.)

> Also, how does something like this get handled?
>
>>>> a = [1, 2, IGNORED(3), NaN]
>
> If I were to say, "What is the mean of 'a'?", then I think most of the time
> people would want 1.5.

I would want NaN! But that's because the only way I get NaN's is when
I do dumb things like compute log(0), and again, I want my code to
tell me that I was dumb instead of just quietly making up a
meaningless answer.

-- Nathaniel


More information about the NumPy-Discussion mailing list