[Numpy-discussion] NA masks in the next numpy release?
Travis Oliphant
oliphant@enthought....
Thu Oct 27 20:16:24 CDT 2011
That is a pretty good explanation. I find myself convinced by Matthew's arguments. I think that being able to separate ABSENT from IGNORED is a good idea. I also like being able to control SKIP and PROPAGATE (but I think the current implementation allows this already).
What is the counter-argument to this proposal?
-Travis
On Oct 27, 2011, at 7:31 PM, Matthew Brett wrote:
> Hi,
>
> On Tue, Oct 25, 2011 at 7:56 PM, Travis Oliphant <oliphant@enthought.com> wrote:
>> So, I am very interested in making sure I remember the details of the counterproposal. What I recall is that you wanted to be able to differentiate between a "bit-pattern" mask and a boolean-array mask in the API. I believe currently even when bit-pattern masks are implemented the difference will be "hidden" from the user on the Python level.
>>
>> I am sure to be missing other parts of the discussion as I have been in and out of it.
>
> The ideas
> --------------
>
> The question that we were addressing in the alter-NEP was: should
> missing values implemented as bitpatterns appear to be the same as
> missing values implemented with masks? We said no, and Mark said yes.
>
> To restate the argument in brief; Nathaniel and I and some others
> thought that there were two separable ideas in play:
>
> 1) A value that is finally and completely missing. == ABSENT
> 2) A value that we would like to ignore for the moment but might want
> back at some future time == IGNORED
>
> (I'm using the adjectives ABSENT and IGNORED here to be short for the
> objects 'absent value' and 'ignored value'. This is to distinguish
> from the verbs below).
>
> We thought bitpatterns were a good match for the former, and masking
> was a good match for the latter.
>
> We all agreed there were two things you might like to do with values
> that were missing in both senses above:
>
> A) PROPAGATE; V + 1 == V
> B) SKIP; K + 1 == 1
>
> (Note verbs for the behaviors).
>
> I believe the original np.ma masked arrays always SKIP.
>
> In [2]: a = np.ma.masked_array?
> In [3]: a = np.ma.masked_array([99, 2], mask=[True, False])
> In [4]: a
> Out[4]:
> masked_array(data = [-- 2],
> mask = [ True False],
> fill_value = 999999)
> In [5]: a.sum()
> Out[5]: 2
>
> There was some discussion as to whether there was a reason to think
> that ABSENT should always or by default PROPAGATE, and IGNORED should
> always or by default SKIP. Chuck is referring to this idea when he
> said further up this thread:
>
>> For instance, I'm thinking skipna=1 is the natural default for the masked arrays.
>
> The current implementation
> ---------------------------------------
>
> What we have now is an implementation of masked arrays, but more
> tightly integrated into the numpy core. In our language we have an
> implementation of IGNORED that is tuned to be nearly indistinguishable
> from the behavior we are expecting of ABSENT.
>
> Specifically, once you have done this:
>
> In [9]: a = np.array([99, 2], maskna=True)
>
> you can get something representing the mask:
>
> In [11]: np.isna(a)
> Out[11]: array([False, False], dtype=bool)
>
> but I believe there is no way of setting the mask directly. In order
> to set the mask, you have to do what looks like an assignment:
>
> In [12]: a[0] = np.NA
> In [14]: a
> Out[14]: array([NA, 2])
>
> In fact, what has happened is the mask has changed, but the underlying
> value has not:
>
> In [18]: orig = np.array([99, 2])
>
> In [19]: a = orig.view(maskna=True)
>
> In [20]: a[0] = np.NA
>
> In [21]: a
> Out[21]: array([NA, 2])
>
> In [22]: orig
> Out[22]: array([99, 2])
>
> This is different from real assignment:
>
> In [23]: a[0] = 0
>
> In [24]: a
> Out[24]: array([0, 2], maskna=True)
>
> In [25]: orig
> Out[25]: array([0, 2])
>
> Some effort has gone into making it difficult to pull off the mask:
>
> In [30]: a.view(np.int64)
> Out[30]: array([NA, 2])
>
> In [31]: a.view(np.int64).flags
> Out[31]:
> C_CONTIGUOUS : True
> F_CONTIGUOUS : True
> OWNDATA : False
> MASKNA : True
> OWNMASKNA : False
> WRITEABLE : True
> ALIGNED : True
> UPDATEIFCOPY : False
>
> In [32]: a.astype(np.int64)
> ---------------------------------------------------------------------------
> ValueError Traceback (most recent call last)
> /home/mb312/<ipython-input-32-e7f3381c9692> in <module>()
> ----> 1 a.astype(np.int64)
>
> ValueError: Cannot assign NA to an array which does not support NAs
>
> The default behavior of the masked values is PROPAGATE, but they can
> be individually made to SKIP:
>
> In [28]: a.sum() # PROPAGATE
> Out[28]: NA(dtype='int64')
>
> In [29]: a.sum(skipna=True) # SKIP
> Out[29]: 2
>
> Where's the beef?
> -------------------------
>
> I personally still think that it is confusing to fuse the concept of:
>
> 1) Masked arrays
> 2) Arrays with bitpattern codes for missing
>
> and the concepts of
>
> A) ABSENT and
> B) IGNORED
>
> Consequences for current code
> --------------------------------------------
>
> Specifically, it still seems to me to make sense to prefer this:
>
>>> a = np.array([99, 2[, masking=True)
>>> a.mask
> [ True, True ]
>>> a.sum()
> 101
>>> a.mask[0] = False
>>> a.sum()
> 2
>
> It might make sense, as Chuck suggests, to change the default to
> 'skipna=True', and I'd further suggest renaming np.NA to np.IGNORED
> and 'skipna' to skipignored' for clarity.
>
> I still think the pseudo-assignment:
>
> In [20]: a[0] = np.NA
>
> is confusing, and should be removed.
>
> Later, should we ever have bitpatterns, there would be something like
> np.ABSENT. This of course would make sense for assignment:
>
> In [20]: a[0] = np.ABSENT
>
> There would be another keyword argument 'skipabsent=False' such that,
> when this is False, the ABSENT values propagate.
>
> Honestly, I think that NA should be a synonym for ABSENT, and so
> should be removed until the dust has settled, and restored as (np.NA
> == np.ABSENT)
>
> And I think, these two ideas, of masking / IGNORED and bitpattern /
> ABSENT, would be much easier to explain.
>
> That's my best shot.
>
> Matthew
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
---
Travis Oliphant
Enthought, Inc.
oliphant@enthought.com
1-512-536-1057
http://www.enthought.com
More information about the NumPy-Discussion
mailing list