[Numpy-discussion] missing data discussion round 2
Matthew Brett
matthew.brett@gmail....
Tue Jun 28 18:00:04 CDT 2011
Hi,
On Tue, Jun 28, 2011 at 11:40 PM, Jason Grout
<jason-sage@creativetrax.com> wrote:
> On 6/28/11 5:20 PM, Matthew Brett wrote:
>> Hi,
>>
>> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith<njs@pobox.com> wrote:
>> ...
>>> (You might think, what difference does it make if you *can* unmask an
>>> item? Us missing data folks could just ignore this feature. But:
>>> whatever we end up implementing is something that I will have to
>>> explain over and over to different people, most of them not
>>> particularly sophisticated programmers. And there's just no sensible
>>> way to explain this idea that if you store some particular value, then
>>> it replaces the old value, but if you store NA, then the old value is
>>> still there.
>>
>> Ouch - yes. No question, that is difficult to explain. Well, I
>> think the explanation might go like this:
>>
>> "Ah, yes, well, that's because in fact numpy records missing values by
>> using a 'mask'. So when you say `a[3] = np.NA', what you mean is,
>> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`"
>>
>> Is that fair?
>
> Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys
> the idea that the entry is still there, but we're just ignoring it. Of
> course, that goes against common convention, but it might be easier to
> explain.
I think Nathaniel's point is that np.IGNORE is a different idea than
np.NA, and that is why joining the implementations can lead to
conceptual confusion. For example, for:
a = np.array([np.NA, 1])
you might expect the result of a.sum() to be np.NA. That's what it is
in R. However for:
b = np.array([np.IGNORE, 1])
you'd probably expect b.sum() to be 1. That's what it is for
masked_array currently.
The current proposal fuses these two ideas with one implementation.
Quoting from the NEP:
>>> a = np.array([1., 3., np.NA, 7.], masked=True)
>>> np.sum(a)
array(NA, dtype='<f8', masked=True)
>>> np.sum(a, skipna=True)
11.0
I agree with Nathaniel, that there is no practical way of avoiding the
full 'NAs are in fact values where theres a False in the mask'
concept, and that does impose a serious conceptual cost on the 'NA'
user.
Best,
Matthew
More information about the NumPy-Discussion
mailing list