[Numpy-discussion] missing data discussion round 2

Nathaniel Smith njs@pobox....
Tue Jun 28 10:06:02 CDT 2011

On Mon, Jun 27, 2011 at 2:03 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:
> On Mon, Jun 27, 2011 at 12:18 PM, Matthew Brett <matthew.brett@gmail.com>
> wrote:
>> You won't get complaints, you'll just lose a group of users, who will,
>> I suspect, stick to NaNs, unsatisfactory as they are.
> This blade cuts both ways, we'd lose a group of users if we don't support
> masking semantics, too.

The problem is, that's inevitable. One might think that trying to find
a compromise solution that picks a few key aspects of each approach
would be a good way to make everyone happy, but in my experience, it
mostly leads to systems that are a muddled mess and that make everyone
unhappy. You're much better off saying screw it, these goals are in
scope and those ones aren't, and we're going to build something
consistent and powerful instead of focusing on how long the feature
list is. That's also the problem with focusing too much on a list of
use cases: you might capture everything on any single list, but there
are actually an infinite variety of use cases that will arise in the
future. If you can generalize beyond the use cases to find some simple
and consistent mental model, and implement that, then that'll work for
all those future use cases too. But sometimes that requires deciding
what *not* to implement.

Just my opinion, but it's fairly hard won.

Anyway, it's pretty clear that in this particular case, there are two
distinct features that different people want: the missing data
feature, and the masked array feature. The more I think about it, the
less I see how they can be combined into one dessert topping + floor
wax solution. Here are three particular points where they seem to
contradict each other:

Missing data: We think memory usage is critical. The ideal solution
has zero overhead. If we can't get that, then at the very least we
want the overhead to be 1 bit/item instead of 1 byte/item.
Masked arrays: We say, it's critical to have good ways to manipulate
the masking array, share it between multiple arrays, and so forth. And
numpy already has great support for all those things! So obviously the
masking array should be exposed as a standard ndarray.

Missing data: Once you've assigned NA to a value, you should *not* be
able to get at what was stored there before.
Masked arrays: You must be able to unmask a value and recover what was
stored there before.

(You might think, what difference does it make if you *can* unmask an
item? Us missing data folks could just ignore this feature. But:
whatever we end up implementing is something that I will have to
explain over and over to different people, most of them not
particularly sophisticated programmers. And there's just no sensible
way to explain this idea that if you store some particular value, then
it replaces the old value, but if you store NA, then the old value is
still there. They will get confused, and then store it away as another
example of how computers are arbitrary and confusing and they're just
too dumb to understand them, and I *hate* doing that to people. Plus
the more that happens, the more they end up digging themselves into
some hole by trying things at random, and then I have to dig them out
again. So the point is, we can go either way, but in both ways there
*is* a cost, and we have to decide.)

Missing data: It's critical that NAs propagate through reduction
operations by default, though there should also be some way to turn
this off.
Masked arrays: Masked values should be silently ignored by reduction
operations, and having to remember to pass a special flag to turn on
this behavior on every single ufunc call would be a huge pain.

(Masked array advocates: please correct me if I'm misrepresenting you
anywhere above!)

> That said, Travis favors doing both, so there's a good chance there will be
> time for it.

One issue with the current draft is that I don't see any addressing of
how masking-missing and bit-pattern-missing interact:
  a = np.zeros(10, dtype="NA[f8]")
  a.flags.hasmask = True
  a[5] = np.NA   # Now what?

If you're going to implement both things anyway, and you need to
figure out how they interact anyway, then why not split them up into
two totally separate features?

Here's my proposal:
1) Add a purely dtype-based support for missing data:
1.A) Add some flags/metadata to the dtype structure to let it describe
what a missing value looks like for an element of its type. Something
like, an example NA value plus a function that can be called to
identify NAs when they occur in arrays. (Notice that this interface is
general enough to handle both the bit-stealing approach and the
maybe() approach.)
1.B) Add an np.NA object, and teach the various coercion loops to use
the above fields in the dtype structure to handle it.
1.C) Teach the various reduction loops that if a particular flag is
set in the dtype, then they also should check for NAs and handle them
appropriately. (If this flag is not set, then it means that this
dtype's ufunc loops are already NA aware and the generic machinery is
not needed unless skipmissing=True is given. This is useful for
user-defined dtypes, and probably also a nice optimization for floats
using NaN.)
1.D) Finally, as a convenience, add some standard NA-aware dtypes.
Personally, I wouldn't bother with complicated string-based
mini-language described in the current NEP; just define some standard
NA-enabled dtype objects in the numpy namespace or provide a function
that takes a dtype + a NA bit-pattern and spits out an NA-enabled
dtype or whatever.

2) Add a better masked array support.
2.A) Masked arrays are simply arrays with an extra attribute
'.visible', which is an arbitrary numpy array that is broadcastable to
the same shape as the masked array. There's no magic here -- if you
say a.visible = b.visible, then they now share a visibility array,
according to the ordinary rules of Python assignment. (Well, there
needs to be some check for shape compatibility, but that's not much
2.B) To minimize confusion with the missing value support, the way you
mask/unmask items is through expressions like 'a.visible[10] = False';
there is no magic np.masked object. (There are a few options for what
happens when you try to use scalar indexing explicitly to extract an
invisible value -- you could return the actual value from behind the
mask, or throw an error, or return a scalar masked array whose
.visible attribute was a scalar array containing False. I don't know
what the people who actually use this stuff would prefer :-).)
2.C) Indexing and shape-changing operations on the masked array are
automatically applied to the .visible array as well. (Attempting to
call .resize() on an array which is being used as the .visible
attribute of some other array is an error.)
2.D) Ufuncs on masked arrays always ignore invisible items. We can
probably share some code here between the handling of skipmissing=True
for NA-enabled dtypes and invisible items in masked arrays, but that's
purely an implementation detail.

This approach to masked arrays requires that the ufunc machinery have
some special knowledge of what a masked array is, so masked arrays
would have to become part of the core. I'm not sure whether or not
they should be part of the np.ndarray base class or remain as a
subclass, though. There's an argument that they're more of a
convenience feature like np.matrix, and code which interfaces between
ndarray's and C becomes more complicated if it has to be prepared to
handle visibility. (Note that in contrast, ndarray's can already
contain arbitrary user-defined dtypes, so the missing value support
proposed here doesn't add any new issues to C interfacing.) So maybe
it'd be better to leave it as a core supported subclass? Could go
either way.

-- Nathaniel

More information about the NumPy-Discussion mailing list