[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Wes McKinney wesmckinn@gmail....
Fri Jun 24 23:06:46 CDT 2011


On Fri, Jun 24, 2011 at 11:59 PM, Nathaniel Smith <njs@pobox.com> wrote:
> On Fri, Jun 24, 2011 at 6:57 PM, Benjamin Root <ben.root@ou.edu> wrote:
>> On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith <njs@pobox.com> wrote:
>>> This is a situation where I would just... use an array and a mask,
>>> rather than a masked array. Then lots of things -- changing fill
>>> values, temporarily masking/unmasking things, etc. -- come from free,
>>> just from knowing how arrays and boolean indexing work?
>>
>> With a masked array, it is "for free".  Why re-invent the wheel?  It has
>> already been done for me.
>
> But it's not for free at all. It's an additional concept that has to
> be maintained, documented, and learned (with the last cost, which is
> multiplied by the number of users, being by far the greatest). It's
> not reinventing the wheel, it's saying hey, I have wheels and axles,
> but what I really need the library to provide is a wheel+axle
> assembly!

You're communicating my argument better than I am.

>>> Do we really get much advantage by building all these complex
>>> operations in? I worry that we're trying to anticipate and write code
>>> for every situation that users find themselves in, instead of just
>>> giving them some simple, orthogonal tools.
>>>
>>
>> This is the danger, and which is why I advocate retaining the MaskedArray
>> type that would provide the high-level "intelligent" operations, meanwhile
>> having in the core the basic data structures for  pairing a mask with an
>> array, and to recognize a special np.NA value that would act upon the mask
>> rather than the underlying data.  Users would get very basic functionality,
>> while the MaskedArray would continue to provide the interface that we are
>> used to.
>
> The interface as described is quite different... in particular, all
> aggregate operations would change their behavior.
>
>>> As a corollary, I worry that learning and keeping track of how masked
>>> arrays work is more hassle than just ignoring them and writing the
>>> necessary code by hand as needed. Certainly I can imagine that *if the
>>> mask is a property of the data* then it's useful to have tools to keep
>>> it aligned with the data through indexing and such. But some of these
>>> other things are quicker to reimplement than to look up the docs for,
>>> and the reimplementation is easier to read, at least for me...
>>
>> What you are advocating is similar to the "tried-n-true" coding practice of
>> Matlab users of using NaNs.  You will hear from Matlab programmers about how
>> it is the greatest idea since sliced bread (and I was one of them).  Then I
>> was introduced to Numpy, and I while I do sometimes still do the NaN
>> approach, I realized that the masked array is a "better" way.
>
> Hey, no need to go around calling people Matlab programmers, you might
> hurt someone's feelings.
>
> But seriously, my argument is that every abstraction and new concept
> has a cost, and I'm dubious that the full masked array abstraction
> carries its weight and justifies this cost, because it's highly
> redundant with existing abstractions. That has nothing to do with how
> tried-and-true anything is.

+1. I think I will personally only be happy if "masked array" can be
implemented while incurring near-zero cost from the end user
perspective. If what we end up with is a faster implementation of
numpy.ma in C I'm probably going to keep on using NaN... That's why
I'm entirely insistent that whatever design be dogfooded on non-expert
users. If it's very much harder / trickier / nuanced than R, you will
have failed.

>> As for documentation, on hard/soft masks, just look at the docs for the
>> MaskedArray constructor:
> [...snipped...]
>
> Thanks!
>
> -- Nathaniel
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>


More information about the NumPy-Discussion mailing list