[Numpy-discussion] in the NA discussion, what can we agree on?
Wed Nov 2 18:37:56 CDT 2011
Okay, here's my attempt at an *uncontroversial* email!
Specifically, I think it'll be easier to talk about this NA stuff if
we can establish some common ground, and easier for people to follow
if the basic points of agreement are laid out in one place. So I'm
going to try and summarize just the things that we can agree about.
Note that right now I'm *only* talking about what kind of tools we
want to give the user -- i.e., what kind of problems we are trying to
solve. AFAICT we don't have as much consensus on implementation
matters, and anyway it's hard to make implementation decisions without
knowing what we're trying to accomplish.
1) I think we have consensus that there are (at least) two different
possible ways of thinking about this problem, with somewhat different
constituencies. Let's call these two concepts "MISSING data" and
2) I also think we have at least a rough consensus on what these
concepts mean, and what their supporters want from them:
- Conceptually, MISSINGness acts like a property of a datum --
assigning MISSING to a location is like assigning any other value to
- Ufuncs and other operations must propagate these values by default,
and there must be an option to cause them to be ignored
- Must be competitive with NaNs in terms of speed and memory usage (or
else people will just use NaNs)
- Compatibility with R is valuable
- To avoid user confusion, ideally it should *not* be possible to
'unmask' a missing value, since this is inconsistent with the "missing
value" metaphor (e.g., see Wes's comment about "leaky abstractions")
- Possible useful extension: having different classes of missing
values (similar to Stata)
- Target audience: data analysis with missing data, neuroimaging,
econometrics, former R users, ...
- Conceptually, IGNOREDness acts like a property of the array --
toggling a location to be IGNORED is kind of vaguely similar to
changing an array's shape
- Ufuncs and other operations must ignore these values by default, and
there doesn't really need to be a way to propagate them, even as an
option (though it probably wouldn't hurt either)
- Some memory overhead is inevitable and acceptable
- Compatibility with R neither possible nor valuable
- Ability to toggle the IGNORED state of a location is critical, and
should be as convenient as possible
- Possible useful extension: having not just different types of
ignored values, but richer ways to combine them -- e.g., the example
of combining astronomical images with some kind of associated
per-pixel quality scores, where one might want the 'mask' to be not
just a boolean IGNORED/not-IGNORED flag, but an integer (perhaps a
multi-byte integer) or even a float, and to allow these 'masks' to be
combined in some more complex way than just logical_and.
- Target audience: anyone who's already doing this kind of thing by
hand using a second mask array + boolean indexing, former numpy.ma
users, matplotlib, ...
3) And perhaps we can all agree that the biggest *un*resolved question
is whether we want to:
- emphasize the similarities between these two use cases and build a
single interface that can handle both concepts, with some compromises
- or, treat these at two mostly-separate features that can each become
exactly what the respective constituency wants without compromise --
but with some potential redundancy and extra code.
Each approach has advantages and disadvantages.
Does that seem like a fair summary? Anything more we can add? Most
importantly, anything here that you disagree with? Did I summarize
your needs well? Do you have a use case that you feel doesn't fit
naturally into either category?
[Also, I thought this might make the start of a good wiki page for
people to reference during these discussions, but I don't seem to have
edit rights. If other people agree, maybe someone could put it up, or
give me access? My trac id is email@example.com.]
More information about the NumPy-Discussion