[Numpy-discussion] use for missing (ignored) data?

Benjamin Root ben.root@ou....
Wed Mar 7 20:13:18 CST 2012


On Wednesday, March 7, 2012, Nathaniel Smith <njs@pobox.com> wrote:
> On Wed, Mar 7, 2012 at 8:05 PM, Neal Becker <ndbecker2@gmail.com> wrote:
>> I'm wondering what is the use for the ignored data feature?
>>
>> I can use:
>>
>> A[valid_A_indexes] = whatever
>>
>> to process only the 'non-ignored' portions of A.  So at least some
simple cases
>> of ignored data are already supported without introducing a new type.
>>
>> OTOH:
>>
>> w = A[valid_A_indexes]
>>
>> will copy A's data, and subsequent use of
>>
>> w[:] = something
>>
>> will not update A.
>>
>> Is this the reason for wanting the ignored data feature?
>
> Hi Neal,
>
> There are a few reasons that I know of why people want more support
> from numpy for ignored data/masks, specifically (as opposed to missing
> data or other related concepts):
>
> 1) If you're often working on some subset of your data, then it's
> convenient to set the mask once and have it stay in effect for further
> operations. Anything you can accomplish this way can also be
> accomplished by keeping an explicit mask array and using it for
> indexing "by hand", but in some situations it may be more convenient
> not to.
>
> 2) Operating on subsets of an array without making a copy. Like
> Benjamin pointed out, indexing with a mask makes a copy. This is slow,
> and what's worse, people who work with large data sets (e.g., big fMRI
> volumes) may not have enough memory to afford such a copy. This
> problem can be solved by using the new where= argument to ufuncs
> (which skips the copy). (But then see (1) -- passing where= to a bunch
> of functions takes more typing than just setting it once and leaving
> it.)
>
> 3) Suppose there's a 3rd-party function that takes an array --
> borrowing Charles example, say it's draw_points(arr). Now you want to
> apply it to just a subset of your data, and want to avoid a copy. It
> would be nice if the original author had made it draw_points(arr,
> mask), but they didn't. Well, if you have masking "built in" to your
> array type, then maybe you can call this as draw_points(masked_arr)
> and it will Just Work. I.e., maybe people who aren't thinking about
> masking will sometimes write code that accidentally works with masking
> anyway. I'm not sure how much I'd trust this, but I guess it's nice
> when it happens. And if it does work, then implementing the show/hide
> point functionality will be easier. (And if it doesn't work, and
> masking is built into numpy.ndarray, then maybe you can use this to
> argue with the original author that this is a bug, not just a missing
> feature. Again, I'm not sure if this is a good thing on net: one could
> argue that people shouldn't be forced to think about masking every
> time they write any function, just in case it becomes relevant later.
> But certainly it'd be useful sometimes.)
>
> There may be other motivations that I'm not aware of, of course.
>
> -- Nathaniel
>

I think you got most of the motivations right. I would say on the last
point that extension authors should be able to say "does not support NA!".
 The important thing is that it makes it more up-front.

An additional motivation is with regards to mathematical operations.
 Personally, I hate getting bitten by a function that takes a max(), and I
have a NaN in the array.  In addition, what about adding two arrays
together that may or may not have different masks?  This has been the major
advantage of no.ma.  All of Mostly Works.

Cheers!
Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20120307/a798bec7/attachment.html 


More information about the NumPy-Discussion mailing list