[Numpy-discussion] missing data discussion round 2
Charles R Harris
Tue Jun 28 11:38:32 CDT 2011
On Tue, Jun 28, 2011 at 9:06 AM, Nathaniel Smith <firstname.lastname@example.org> wrote:
> On Mon, Jun 27, 2011 at 2:03 PM, Mark Wiebe <email@example.com> wrote:
> > On Mon, Jun 27, 2011 at 12:18 PM, Matthew Brett <firstname.lastname@example.org
> > wrote:
> >> You won't get complaints, you'll just lose a group of users, who will,
> >> I suspect, stick to NaNs, unsatisfactory as they are.
> > This blade cuts both ways, we'd lose a group of users if we don't support
> > masking semantics, too.
> The problem is, that's inevitable. One might think that trying to find
> a compromise solution that picks a few key aspects of each approach
> would be a good way to make everyone happy, but in my experience, it
> mostly leads to systems that are a muddled mess and that make everyone
> unhappy. You're much better off saying screw it, these goals are in
> scope and those ones aren't, and we're going to build something
> consistent and powerful instead of focusing on how long the feature
> list is. That's also the problem with focusing too much on a list of
> use cases: you might capture everything on any single list, but there
> are actually an infinite variety of use cases that will arise in the
> future. If you can generalize beyond the use cases to find some simple
> and consistent mental model, and implement that, then that'll work for
> all those future use cases too. But sometimes that requires deciding
> what *not* to implement.
> Just my opinion, but it's fairly hard won.
> Anyway, it's pretty clear that in this particular case, there are two
> distinct features that different people want: the missing data
> feature, and the masked array feature. The more I think about it, the
> less I see how they can be combined into one dessert topping + floor
> wax solution. Here are three particular points where they seem to
> contradict each other:
> Missing data: We think memory usage is critical. The ideal solution
> has zero overhead. If we can't get that, then at the very least we
> want the overhead to be 1 bit/item instead of 1 byte/item.
> Masked arrays: We say, it's critical to have good ways to manipulate
> the masking array, share it between multiple arrays, and so forth. And
> numpy already has great support for all those things! So obviously the
> masking array should be exposed as a standard ndarray.
> Missing data: Once you've assigned NA to a value, you should *not* be
> able to get at what was stored there before.
> Masked arrays: You must be able to unmask a value and recover what was
> stored there before.
> (You might think, what difference does it make if you *can* unmask an
> item? Us missing data folks could just ignore this feature. But:
> whatever we end up implementing is something that I will have to
> explain over and over to different people, most of them not
> particularly sophisticated programmers. And there's just no sensible
> way to explain this idea that if you store some particular value, then
> it replaces the old value, but if you store NA, then the old value is
> still there. They will get confused, and then store it away as another
> example of how computers are arbitrary and confusing and they're just
> too dumb to understand them, and I *hate* doing that to people. Plus
> the more that happens, the more they end up digging themselves into
> some hole by trying things at random, and then I have to dig them out
> again. So the point is, we can go either way, but in both ways there
> *is* a cost, and we have to decide.)
> Missing data: It's critical that NAs propagate through reduction
> operations by default, though there should also be some way to turn
> this off.
> Masked arrays: Masked values should be silently ignored by reduction
> operations, and having to remember to pass a special flag to turn on
> this behavior on every single ufunc call would be a huge pain.
> (Masked array advocates: please correct me if I'm misrepresenting you
> anywhere above!)
> > That said, Travis favors doing both, so there's a good chance there will
> > time for it.
> One issue with the current draft is that I don't see any addressing of
> how masking-missing and bit-pattern-missing interact:
> a = np.zeros(10, dtype="NA[f8]")
> a.flags.hasmask = True
> a = np.NA # Now what?
> If you're going to implement both things anyway, and you need to
> figure out how they interact anyway, then why not split them up into
> two totally separate features?
> Here's my proposal:
> 1) Add a purely dtype-based support for missing data:
> 1.A) Add some flags/metadata to the dtype structure to let it describe
> what a missing value looks like for an element of its type. Something
> like, an example NA value plus a function that can be called to
> identify NAs when they occur in arrays. (Notice that this interface is
> general enough to handle both the bit-stealing approach and the
> maybe() approach.)
> 1.B) Add an np.NA object, and teach the various coercion loops to use
> the above fields in the dtype structure to handle it.
> 1.C) Teach the various reduction loops that if a particular flag is
> set in the dtype, then they also should check for NAs and handle them
> appropriately. (If this flag is not set, then it means that this
> dtype's ufunc loops are already NA aware and the generic machinery is
> not needed unless skipmissing=True is given. This is useful for
> user-defined dtypes, and probably also a nice optimization for floats
> using NaN.)
> 1.D) Finally, as a convenience, add some standard NA-aware dtypes.
> Personally, I wouldn't bother with complicated string-based
> mini-language described in the current NEP; just define some standard
> NA-enabled dtype objects in the numpy namespace or provide a function
> that takes a dtype + a NA bit-pattern and spits out an NA-enabled
> dtype or whatever.
> 2) Add a better masked array support.
> 2.A) Masked arrays are simply arrays with an extra attribute
> '.visible', which is an arbitrary numpy array that is broadcastable to
> the same shape as the masked array. There's no magic here -- if you
> say a.visible = b.visible, then they now share a visibility array,
> according to the ordinary rules of Python assignment. (Well, there
> needs to be some check for shape compatibility, but that's not much
> 2.B) To minimize confusion with the missing value support, the way you
> mask/unmask items is through expressions like 'a.visible = False';
> there is no magic np.masked object. (There are a few options for what
> happens when you try to use scalar indexing explicitly to extract an
> invisible value -- you could return the actual value from behind the
> mask, or throw an error, or return a scalar masked array whose
> .visible attribute was a scalar array containing False. I don't know
> what the people who actually use this stuff would prefer :-).)
> 2.C) Indexing and shape-changing operations on the masked array are
> automatically applied to the .visible array as well. (Attempting to
> call .resize() on an array which is being used as the .visible
> attribute of some other array is an error.)
> 2.D) Ufuncs on masked arrays always ignore invisible items. We can
> probably share some code here between the handling of skipmissing=True
> for NA-enabled dtypes and invisible items in masked arrays, but that's
> purely an implementation detail.
> This approach to masked arrays requires that the ufunc machinery have
> some special knowledge of what a masked array is, so masked arrays
> would have to become part of the core. I'm not sure whether or not
> they should be part of the np.ndarray base class or remain as a
> subclass, though. There's an argument that they're more of a
> convenience feature like np.matrix, and code which interfaces between
> ndarray's and C becomes more complicated if it has to be prepared to
> handle visibility. (Note that in contrast, ndarray's can already
> contain arbitrary user-defined dtypes, so the missing value support
> proposed here doesn't add any new issues to C interfacing.) So maybe
> it'd be better to leave it as a core supported subclass? Could go
> either way.
Nathaniel, an implementation using masks will look *exactly* like an
implementation using na-dtypes from the user's point of view. Except that
taking a masked view of an unmasked array allows ignoring values without
destroying or copying the original data. The only downside I can see to an
implementation using masks is memory and disk storage, and perhaps memory
mapped arrays. And I rather expect the former to solve itself in a few
years, eight gigs is becoming a baseline for workstations and in a couple of
years I expect that to be up around 16-32, and a few years after that.... In
any case we are talking 12% - 25% overhead, and in practice I expect it
won't be quite as big a problem as folks project.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion