[Numpy-discussion] Concepts for masked/missing data

Wes McKinney wesmckinn@gmail....
Sat Jun 25 12:17:48 CDT 2011


On Sat, Jun 25, 2011 at 1:05 PM, Nathaniel Smith <njs@pobox.com> wrote:
> On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett <matthew.brett@gmail.com> wrote:
>> So far I see the difference between 1) and 2) being that you cannot
>> unmask.  So, if you didn't even know you could unmask data, then it
>> would not matter that 1) was being implemented by masks?
>
> I guess that is a difference, but I'm trying to get at something more
> fundamental -- not just what operations are allowed, but what
> operations people *expect* to be allowed. It seems like some of us
> have been talking past each other a lot, where someone says "but
> changing masks is the single most important feature!" and then someone
> else says "what are you talking about that doesn't even make sense".
>
>> To clarify, you're proposing for:
>>
>> a = np.sum(np.array([np.NA, np.NA])
>>
>> 1) -> np.NA
>> 2) -> 0.0
>
> Yes -- and in R you get actually do get NA, while in numpy.ma you
> actually do get 0. I don't think this is a coincidence; I think it's
> because they're designed as coherent systems that are trying to solve
> different problems. (Well, numpy.ma's "hardmask" idea seems inspired
> by the missing-data concept rather than the temporary-mask concept,
> but aside from that it seems pretty consistent in implementing option
> 2.)

Agree. My basic observation about numpy.ma is that it's a finely
crafted solution for a different set of problems than the ones I have.
I just don't want the same thing to happen here so I'm stuck writing
code (like I am now) that looks like

mask = y.mask
the_sum = y.sum(axis)
the_count = mask.sum(axis)
the_sum[the_count == 0] = nan

> Here's another possible difference -- in (1), intuitively, missingness
> is a property of the data, so the logical place to put information
> about whether you can expect missing values is in the dtype, and to
> enable missing values you need to make a new array with a new dtype.
> (If we use a mask-based implementation, then
> np.asarray(nomissing_array, dtype=yesmissing_type) would still be able
> to skip making a copy of the data -- I'm talking ONLY about the
> interface here, not whether missing data has a different storage
> format from non-missing data.)
>
> In (2), the whole point is to use different masks with the same data,
> so I'd argue masking should be a property of the array object rather
> than the dtype, and the interface should logically allow masks to be
> created, modified, and destroyed in place.
>
> They're both internally consistent, but I think we might have to make
> a decision and stick to it.
>
>> I agree it's good to separate the API from the implementation.   I
>> think the implementation is also important because I care about memory
>> and possibly speed.  But, that is a separate problem from the API...
>
> Yes, absolutely memory and speed are important. But a really fast
> solution to the wrong problem isn't so useful either :-).
>
> -- Nathaniel
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>


More information about the NumPy-Discussion mailing list