[Numpy-discussion] NA masks in the next numpy release?

Matthew Brett matthew.brett@gmail....
Thu Oct 27 19:31:52 CDT 2011


On Tue, Oct 25, 2011 at 7:56 PM, Travis Oliphant <oliphant@enthought.com> wrote:
> So, I am very interested in making sure I remember the details of the counterproposal.    What I recall is that you wanted to be able to differentiate between a "bit-pattern" mask and a boolean-array mask in the API.   I believe currently even when bit-pattern masks are implemented the difference will be "hidden" from the user on the Python level.
> I am sure to be missing other parts of the discussion as I have been in and out of it.

The ideas

The question that we were addressing in the alter-NEP was: should
missing values implemented as bitpatterns appear to be the same as
missing values implemented with masks?  We said no, and Mark said yes.

To restate the argument in brief; Nathaniel and I and some others
thought that there were two separable ideas in play:

1) A value that is finally and completely missing. == ABSENT
2) A value that we would like to ignore for the moment but might want
back at some future time == IGNORED

(I'm using the adjectives ABSENT and IGNORED here to be short for the
objects 'absent value'  and 'ignored value'.  This is to distinguish
from the verbs below).

We thought bitpatterns were a good match for the former, and masking
was a good match for the latter.

We all agreed there were two things you might like to do with values
that were missing in both senses above:

A) PROPAGATE; V + 1 == V
B) SKIP; K + 1 == 1

(Note verbs for the behaviors).

I believe the original np.ma masked arrays always SKIP.

In [2]: a = np.ma.masked_array?
In [3]: a = np.ma.masked_array([99, 2], mask=[True, False])
In [4]: a
masked_array(data = [-- 2],
             mask = [ True False],
       fill_value = 999999)
In [5]: a.sum()
Out[5]: 2

There was some discussion as to whether there was a reason to think
that ABSENT should always or by default PROPAGATE, and IGNORED should
always or by default SKIP.  Chuck is referring to this idea when he
said further up this thread:

> For instance, I'm thinking skipna=1 is the natural default for the masked arrays.

The current implementation

What we have now is an implementation of masked arrays, but more
tightly integrated into the numpy core.  In our language we have an
implementation of IGNORED that is tuned to be nearly indistinguishable
from the behavior we are expecting of ABSENT.

Specifically, once you have done this:

In [9]: a = np.array([99, 2], maskna=True)

you can get something representing the mask:

In [11]: np.isna(a)
Out[11]: array([False, False], dtype=bool)

but I believe there is no way of setting the mask directly.  In order
to set the mask, you have to do what looks like an assignment:

In [12]: a[0] = np.NA
In [14]: a
Out[14]: array([NA, 2])

In fact, what has happened is the mask has changed, but the underlying
value has not:

In [18]: orig = np.array([99, 2])

In [19]: a = orig.view(maskna=True)

In [20]: a[0] = np.NA

In [21]: a
Out[21]: array([NA, 2])

In [22]: orig
Out[22]: array([99,  2])

This is different from real assignment:

In [23]: a[0] = 0

In [24]: a
Out[24]: array([0, 2], maskna=True)

In [25]: orig
Out[25]: array([0, 2])

Some effort has gone into making it difficult to pull off the mask:

In [30]: a.view(np.int64)
Out[30]: array([NA, 2])

In [31]: a.view(np.int64).flags
  OWNDATA : False
  MASKNA : True
  ALIGNED : True

In [32]: a.astype(np.int64)
ValueError                                Traceback (most recent call last)
/home/mb312/<ipython-input-32-e7f3381c9692> in <module>()
----> 1 a.astype(np.int64)

ValueError: Cannot assign NA to an array which does not support NAs

The default behavior of the masked values is PROPAGATE, but they can
be individually made to SKIP:

In [28]: a.sum() # PROPAGATE
Out[28]: NA(dtype='int64')

In [29]: a.sum(skipna=True) # SKIP
Out[29]: 2

Where's the beef?

I personally still think that it is confusing to fuse the concept of:

1) Masked arrays
2) Arrays with bitpattern codes for missing

and the concepts of


Consequences for current code

Specifically, it still seems to me to make sense to prefer this:

>> a = np.array([99, 2[, masking=True)
>> a.mask
[ True, True ]
>> a.sum()
>> a.mask[0] = False
>> a.sum()

It might make sense, as Chuck suggests, to change the default to
'skipna=True', and I'd further suggest renaming np.NA to np.IGNORED
and 'skipna' to skipignored' for clarity.

I still think the pseudo-assignment:

In [20]: a[0] = np.NA

is confusing, and should be removed.

Later, should we ever have bitpatterns, there would be something like
np.ABSENT.  This of course would make sense for assignment:

In [20]: a[0] = np.ABSENT

There would be another keyword argument 'skipabsent=False' such that,
when this is False, the ABSENT values propagate.

Honestly, I think that NA should be a synonym for ABSENT, and so
should be removed until the dust has settled, and restored as (np.NA
== np.ABSENT)

And I think, these two ideas, of masking / IGNORED and bitpattern /
ABSENT, would be much easier to explain.

That's my best shot.


More information about the NumPy-Discussion mailing list