[Numpy-discussion] Concepts for masked/missing data

Nathaniel Smith njs@pobox....
Sat Jun 25 11:05:54 CDT 2011


So obviously there's a lot of interest in this question, but I'm
losing track of all the different issues that've being raised in the
150-post thread of doom. I think I'll find this easier if we start by
putting aside the questions about implementation and such and focus
for now on the *conceptual model* that we want. Maybe I'm not the only
one?

So as far as I can tell, there are three different ways of thinking
about masked/missing data that people have been using in the other
thread:

1) Missingness is part of the data. Some data is missing, some isn't,
this might change through computation on the data (just like some data
might change from a 3 to a 6 when we apply some transformation, NA |
True could be True, instead of NA), but we can't just "decide" that
some data is no longer missing. It makes no sense to ask what value is
"really" there underneath the missingness. And It's critical that we
keep track of this through all operations, because otherwise we may
silently give incorrect answers -- exactly like it's critical that we
keep track of the difference between 3 and 6.

2) All the data exists, at least in some sense, but we don't always
want to look at all of it. We lay a mask over our data to view and
manipulate only parts of it at a time. We might want to use different
masks at different times, mutate the mask as we go, etc. The most
important thing is to provide convenient ways to do complex
manipulations -- preserve masks through indexing operations, overlay
the mask from one array on top of another array, etc. When it comes to
other sorts of operations then we'd rather just silently skip the
masked values -- we know there are values that are masked, that's the
whole point, to work with the unmasked subset of the data, so if sum
returned NA then that would just be a stupid hassle.

3) The "all things to all people" approach: implement every feature
implied by either (1) or (2), and switch back and forth between these
conceptual frameworks whenever necessary to make sense of the
resulting code.

The advantage of deciding up front what our model is is that it makes
a lot of other questions easier. E.g., someone asked in the other
thread whether, after setting an array element to NA, it would be
possible to get back the original value. If we follow (1), the answer
is obviously "no", if we follow (2), the answer is obviously "yes",
and if we follow (3), the answer is obviously "yes, probably, well,
maybe you better check the docs?".

My personal opinions on these are:
(1): This is a real problem I face, and there isn't any good solution
now. Support for this in numpy would be awesome.
(2): This feels more like a convenience feature to me; we already have
lots of ways to work with subsets of data. I probably wouldn't bother
using it, but that's fine -- I don't use np.matrix either, but some
people like it.
(3): Well, it's a bit of a mess, but I guess it might be better than nothing?

But that's just my opinion. I'm wondering if we can get any consensus
on which of these we actually *want* (or maybe we want some fourth
option!), and *then* we can try to figure out the best way to get
there? Pretty much any implementation strategy we've talked about
could work for any of these, but hard to decide between them if we
don't even know what we're trying to do...

-- Nathaniel


More information about the NumPy-Discussion mailing list