[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Benjamin Root ben.root@ou....
Fri Jun 24 20:25:47 CDT 2011

On Fri, Jun 24, 2011 at 8:00 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:

> On Fri, Jun 24, 2011 at 6:22 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
>>  On Fri, Jun 24, 2011 at 7:10 PM, Charles R Harris
>> <charlesr.harris@gmail.com> wrote:
>> >
>> >
>> > On Fri, Jun 24, 2011 at 4:21 PM, Matthew Brett <matthew.brett@gmail.com
>> >
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> On Fri, Jun 24, 2011 at 10:09 PM, Benjamin Root <ben.root@ou.edu>
>> wrote:
>> >> ...
>> >> > Again, there are pros and cons either way and I see them very
>> orthogonal
>> >> > and
>> >> > complementary.
>> >>
>> >> That may be true, but I imagine only one of them will be implemented.
>> >>
>> >> @Mark - I don't have a clear idea whether you consider the nafloat64
>> >> option to be still in play as the first thing to be implemented
>> >> (before array.mask).   If it is, what kind of thing would persuade you
>> >> either way?
>> >>
>> >
>> > Mark can speak for himself,  but I think things are tending towards
>> masks.
>> > They have the advantage of one implementation for all data types,
>> current
>> > and future, and they are more flexible since the masked data can be
>> actual
>> > valid data that you just choose to ignore for experimental  reasons.
>> >
>> > What might be helpful is a routine to import/export R files, but that
>> > shouldn't be to difficult to implement.
>> >
>> > Chuck
>> >
>> >
>> > _______________________________________________
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion@scipy.org
>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>> >
>> Perhaps we should make a wiki page someplace summarizing pros and cons
>> of the various implementation approaches? I worry very seriously about
>> adding API functions relating to masks rather than having special NA
>> values which propagate in algorithms. The question is: will Joe Blow
>> Former R user have to understand what is the mask and how to work with
>> it? If the answer is yes we have a problem. If it can be completely
>> hidden as an implementation detail, that's great. In R NAs are just
>> sort of inherent-- they propagate you deal with them when you have to
>> via na.rm flag in functions or is.na.
> I think the interface for how it looks in NumPy can be made to be pretty
> close to the same with either design approach. I've updated the NEP to add
> and emphasize using masked values with an np.NA singleton, with the
> validitymask as the implementation mechanism which is still accessible for
> those who want to still deal with the mask directly.

I think there are a lot of benefits to this idea, if I understand it
correctly.  Essentially, if I were to assign np.NA to an element (or a
slice) of a numpy array, rather than actually assigning that value to that
spot in the array, it would set the mask to True for those elements?

I see a lot of benefits to  this idea.  Imagine in pylab mode (from pylab
import *), users would have the NA name right in their namespace, just like
they are used to with R.  And those who want could still mess around with
masks much like we are used to.  Plus, I think we can still retain the good
ol' C-pointer to regular data.  My question is this.  Will it be a soft or
hard mask?  In other words, if I were to assign np.NA to a spot in an array,
would it kill the value that was there (if it already was initialized)?
Would I still be able to share masks?

Admittedly, it is a munging of two distinct ideas, but, I think the end
result would still be the same.

>> The other problem I can think of with masks is the extra memory
>> footprint, though maybe this is no cause for concern.
> The overhead is definitely worth considering, along with the extra memory
> traffic it generates, and I've basically concluded that the increased
> generality and flexibility is worth the added cost.
If we go with the mask approach, one could later try and optimize the
implementation of the masks to reduce the memory footprint.  Potentially,
one could reduce the footprint by 7/8ths!  Maybe some sneaky striding tricks
could help keep from too much cache misses (or go with the approach Pierre
mentioned which was to calculate them all and let the masks sort them out

As a complete side-thought, I wonder how sparse arrays could play into this

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110624/3ecd1375/attachment.html 

More information about the NumPy-Discussion mailing list