[Numpy-discussion] NA masks in the next numpy release?
Fri Oct 28 00:56:22 CDT 2011
On Thursday, October 27, 2011, Charles R Harris <email@example.com>
> On Thu, Oct 27, 2011 at 7:16 PM, Travis Oliphant <firstname.lastname@example.org>
>> That is a pretty good explanation. I find myself convinced by Matthew's
arguments. I think that being able to separate ABSENT from IGNORED is a
good idea. I also like being able to control SKIP and PROPAGATE (but I
think the current implementation allows this already).
>> What is the counter-argument to this proposal?
> What exactly do you find convincing? The current masks propagate by
> In : a = ones(5, maskna=1)
> In : a = NA
> In : a
> Out: array([ 1., 1., NA, 1., 1.])
> In : a + 1
> Out: array([ 2., 2., NA, 2., 2.])
> In : a = 10
> In : a
> Out: array([ 1., 1., 10., 1., 1.], maskna=True)
> I don't see an essential difference between the implementation using masks
and one using bit patterns, the mask when attached to the original array
just adds a bit pattern by extending all the types by one byte, an approach
that easily extends to all existing and future types, which is why Mark went
that way for the first implementation given the time available. The masks
are hidden because folks wanted something that behaved more like R and also
because of the desire to combine the missing, ignore, and later possibly bit
patterns in a unified manner. Note that the pseudo assignment was also meant
to look like R. Adding true bit patterns to numpy isn't trivial and I
believe Mark was thinking of parametrized types for that.
> The main problems I see with masks are unified storage and possibly memory
use. The rest is just behavor and desired API and that can be adjusted
within the current implementation. There is nothing essentially masky about
I think chuck sums it up quite nicely. The implementation detail about
using mask versus bit patterns can still be discussed and addressed.
Personally, I just don't see how parameterized dtypes would be easier to use
than the pseudo assignment.
The elegance of mark's solution was to consider the treatment of missing
data in a unified manner. This puts missing data in a more prominent spot
for extension builders, which should greatly improve support throughout the
ecosystem. By letting there be a single missing data framework (instead of
two) all that users need to figure out is when they want nan-like behavior
(propagate) or to be more like masks (skip). Numpy takes care of the rest.
There is a reason why I like using masked arrays because I don't have to
use nansum in my library functions to guard against the possibility of
receiving nans. Duck-typing is a good thing.
My argument against separating IGNORE and PROPAGATE is that it becomes too
tempting to want to mix these in an array, but the desired behavior would
likely become ambiguous..
There is one other proplem that I just thought of that I don't think has
been outlined in either NEP. What if I perform an operation between an
array set up with propagate NAs and an array with skip NAs?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion