[Numpy-discussion] in the NA discussion, what can we agree on?

Nathaniel Smith njs@pobox....
Wed Nov 2 21:16:39 CDT 2011


Hi Benjamin,

On Wed, Nov 2, 2011 at 5:25 PM, Benjamin Root <ben.root@ou.edu> wrote:
> I want to pare this down even more.  I think the above lists makes too many
> unneeded extrapolations.

Okay. I found your formatting a little confusing, so I want to make
sure I understood the changes you're suggesting:

For the description of what MISSING means, you removed the lines:
- Compatibility with R is valuable
- To avoid user confusion, ideally it should *not* be possible to
'unmask' a missing value, since this is inconsistent with the "missing
value" metaphor (e.g., see Wes's comment about "leaky abstractions")

And you added the line:
+ Assigning MISSING is destructive

And for the description of what IGNORED means, you removed the lines:
- Some memory overhead is inevitable and acceptable
- Compatibility with R neither possible nor valuable
- Ability to toggle the IGNORED state of a location is critical, and
should be as convenient as possible

And you added the lines:
+ IGNORE is non-destructive
+ Must be competitive with np.ma for speed and memory (or else users
would just use np.ma)

Is that right?

Assuming it is, my thoughts are:

By R compatibility, I specifically had in mind in-memory
compatibility. rpy2 provides a more-or-less seamless within-process
interface between R and Python (and specifically lets you get numpy
views on arrays returned by R functions), so if we can make this work
for R arrays containing NA too then that'd be handy. (The rpy2 author
requested this in the last discussion here:
http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057084.html)
When it comes to disk formats, then this doesn't matter so much, since
IO routines have to translate between different representations all
the time anyway.

I take the replacement of my line about MISSING disallowing unmasking
and your line about MISSING assignment being destructive as basically
expressing the same idea. Is that fair, or did you mean something
else?

Finally, do you think that people who want IGNORED support care about
having a convenient API for masking/unmasking values? You removed that
line, but I don't know if that was because you disagreed with it, or
were just trying to simplify.

> Then, as a third-party module developer, I can tell you that having separate
> and independent ways to detect "MISSING"/"IGNORED" would likely make support
> more difficult and would greatly benefit from a common (or easily
> combinable) method of identification.

Right, sorry... I didn't forget, and that's part of what I was
thinking when I described the second approach as keeping them as
*mostly*-separate interfaces... but I should have made it more
explicit! Anyway, yes:

4) There is consensus that whatever approach is taken, there should be
a quick and convenient way to identify values that are MISSING,
IGNORED, or both. (E.g., functions is_MISSING, is_IGNORED,
is_MISSING_or_IGNORED, or some equivalent.)

-- Nathaniel


More information about the NumPy-Discussion mailing list