[Numpy-discussion] missing data discussion round 2
Matthew Brett
matthew.brett@gmail....
Wed Jun 29 14:32:14 CDT 2011
Hi,
On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:
> On Wed, Jun 29, 2011 at 8:20 AM, Lluís <xscript@gmx.net> wrote:
>>
>> Matthew Brett writes:
>>
>> >> Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys
>> >> the idea that the entry is still there, but we're just ignoring it. Of
>> >> course, that goes against common convention, but it might be easier to
>> >> explain.
>>
>> > I think Nathaniel's point is that np.IGNORE is a different idea than
>> > np.NA, and that is why joining the implementations can lead to
>> > conceptual confusion.
>>
>> This is how I see it:
>>
>> >>> a = np.array([0, 1, 2], dtype=int)
>> >>> a[0] = np.NA
>> ValueError
>> >>> e = np.array([np.NA, 1, 2], dtype=int)
>> ValueError
>> >>> b = np.array([np.NA, 1, 2], dtype=np.maybe(int))
>> >>> m = np.array([np.NA, 1, 2], dtype=int, masked=True)
>> >>> bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True)
>> >>> b[1] = np.NA
>> >>> np.sum(b)
>> np.NA
>> >>> np.sum(b, skipna=True)
>> 2
>> >>> b.mask
>> None
>> >>> m[1] = np.NA
>> >>> np.sum(m)
>> 2
>> >>> np.sum(m, skipna=True)
>> 2
>> >>> m.mask
>> [False, False, True]
>> >>> bm[1] = np.NA
>> >>> np.sum(bm)
>> 2
>> >>> np.sum(bm, skipna=True)
>> 2
>> >>> bm.mask
>> [False, False, True]
>>
>> So:
>>
>> * Mask takes precedence over bit pattern on element assignment. There's
>> still the question of how to assign a bit pattern NA when the mask is
>> active.
>>
>> * When using mask, elements are automagically skipped.
>>
>> * "m[1] = np.NA" is equivalent to "m.mask[1] = False"
>>
>> * When using bit pattern + mask, it might make sense to have the initial
>> values as bit-pattern NAs, instead of masked (i.e., "bm.mask == [True,
>> False, True]" and "np.sum(bm) == np.NA")
>
> There seems to be a general idea that masks and NA bit patterns imply
> particular differing semantics, something which I think is simply false.
Well - first - it's helpful surely to separate the concepts and the
implementation.
Concepts / use patterns (as delineated by Nathaniel):
A) missing values == 'np.NA' in my emails. Can we call that CMV
(concept missing values)?
B) masks == np.IGNORE in my emails . CMSK (concept masks)?
Implementations
1) bit-pattern == na-dtype - how about we call that IBP
(implementation bit patten)?
2) array.mask. IM (implementation mask)?
Nathaniel implied that:
CMV implies: sum([np.NA, 1]) == np.NA
CMSK implies sum([np.NA, 1]) == 1
and indeed, that's how R and masked arrays respectively behave. So I
think it's reasonable to say that at least R thought that the bitmask
implied the first and Pierre and others thought the mask meant the
second.
The NEP as it stands thinks of CMV and and CM as being different views
of the same thing, Please correct me if I'm wrong.
> Both NaN and Inf are implemented in hardware with the same idea as the NA
> bit pattern, but they do not follow NA missing value semantics.
Right - and that doesn't affect the argument, because the argument is
about the concepts and not the implementation.
> As far as I can tell, the only required difference between them is that NA
> bit patterns must destroy the data. Nothing else.
I think Nathaniel's point was about the expected default behavior in
the different concepts.
> Everything on top of that
> is a choice of API and interface mechanisms. I want them to behave exactly
> the same except for that necessary difference, so that it will be possible
> to use the *exact same Python code* with either approach.
Right. And Nathaniel's point is that that desire leads to fusion of
the two ideas into one when they should be separated. For example, if
I understand correctly:
>>> a = np.array([1.0, 2.0, 3, 7.0], masked=True)
>>> b = np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')
>>> a[3] = np.NA # actual real hand-on-heart assignment
>>> b[3] = np.NA # magic mask setting although it looks the same
> Say you're using NA dtypes, and suddenly you think, "what if I temporarily
> treated these as NA too". Now you have to copy your whole array to avoid
> destroying your data! The NA bit pattern didn't save you memory here... Say
> you're using masks, and it turns out you didn't actually need masking
> semantics. If they're different, you now have to do lots of code changes to
> switch to NA dtypes!
I personally have not run across that case. I'd imagine that, if you
knew you wanted to do something so explicitly masking-like, you'd
start with the masking interface.
Clearly there are some overlaps between what masked arrays are trying
to achieve and what Rs NA mechanisms are trying to achieve. Are they
really similar enough that they should function using the same API?
And if so, won't that be confusing? I think that's the question
that's being asked.
See you,
Matthew
More information about the NumPy-Discussion
mailing list