[Numpy-discussion] Concepts for masked/missing data
Sat Jun 25 13:06:14 CDT 2011
On Sat, Jun 25, 2011 at 11:26 AM, Matthew Brett <firstname.lastname@example.org>wrote:
> On Sat, Jun 25, 2011 at 5:05 PM, Nathaniel Smith <email@example.com> wrote:
> > So obviously there's a lot of interest in this question, but I'm
> > losing track of all the different issues that've being raised in the
> > 150-post thread of doom. I think I'll find this easier if we start by
> > putting aside the questions about implementation and such and focus
> > for now on the *conceptual model* that we want. Maybe I'm not the only
> > one?
> > So as far as I can tell, there are three different ways of thinking
> > about masked/missing data that people have been using in the other
> > thread:
> > 1) Missingness is part of the data. Some data is missing, some isn't,
> > this might change through computation on the data (just like some data
> > might change from a 3 to a 6 when we apply some transformation, NA |
> > True could be True, instead of NA), but we can't just "decide" that
> > some data is no longer missing. It makes no sense to ask what value is
> > "really" there underneath the missingness. And It's critical that we
> > keep track of this through all operations, because otherwise we may
> > silently give incorrect answers -- exactly like it's critical that we
> > keep track of the difference between 3 and 6.
> So far I see the difference between 1) and 2) being that you cannot
> unmask. So, if you didn't even know you could unmask data, then it
> would not matter that 1) was being implemented by masks?
Yes, bingo, you hit it right on the nose. Essentially, 1) could be
considered the "hard mask", while 2) would be the "soft mask". Everything
else is implementation details.
> > 2) All the data exists, at least in some sense, but we don't always
> > want to look at all of it. We lay a mask over our data to view and
> > manipulate only parts of it at a time. We might want to use different
> > masks at different times, mutate the mask as we go, etc. The most
> > important thing is to provide convenient ways to do complex
> > manipulations -- preserve masks through indexing operations, overlay
> > the mask from one array on top of another array, etc. When it comes to
> > other sorts of operations then we'd rather just silently skip the
> > masked values -- we know there are values that are masked, that's the
> > whole point, to work with the unmasked subset of the data, so if sum
> > returned NA then that would just be a stupid hassle.
> To clarify, you're proposing for:
> a = np.sum(np.array([np.NA, np.NA])
> 1) -> np.NA
> 2) -> 0.0
Actually, I have always considered this to be a bug. Note that "np.sum()"
also returns 0.0. I think the reason why it has been returning zero instead
of NaN was because there wasn't a NaN-equivalent for integers. This is
where I think a np.NA could best serve NumPy by providing a dtype-agnostic
way to represent missing or invalid data.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion