[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Mark Wiebe mwwiebe@gmail....
Thu Jun 23 17:24:40 CDT 2011

On Thu, Jun 23, 2011 at 5:05 PM, Keith Goodman <kwgoodman@gmail.com> wrote:

> On Thu, Jun 23, 2011 at 1:53 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:
> > Enthought has asked me to look into the "missing data" problem and how
> NumPy
> > could treat it better. I've considered the different ideas of adding
> dtype
> > variants with a special signal value and masked arrays, and concluded
> that
> > adding masks to the core ndarray appears is the best way to deal with the
> > problem in general.
> > I've written a NEP that proposes a particular design, viewable here:
> >
> https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst
> > There are some questions at the bottom of the NEP which definitely need
> > discussion to find the best design choices. Please read, and let me know
> of
> > all the errors and gaps you find in the document.
> Wow, that is exciting.
> I wonder about the relative performance of the two possible
> implementations (mask and NA) in the PEP.

I've given that some thought, and I don't think there's a clear way to tell
what the performance gap would be without implementations of both to
benchmark against each other. I favor the mask primarily because it provides
masking for all data types in one go with a single consistent interface to
program against. For adding NA signal values, each new data type would need
a lot of work to gain the same level of support.

If you are, say, doing a calculation along the columns of a 2d array
> one element at a time, then you will need to grab an element from the
> array and grab the corresponding element from the mask. I assume the
> corresponding data and mask elements are not stored together. That
> would be slow since memory access is usually were time is spent. In
> this regard NA would be faster.

Yes, the masks add more memory traffic and some extra calculation, while the
NA signal values just require some additional calculations.

I currently use NaN as a missing data marker. That adds things like
> this to my cython code:
>    if a[i] == a[i]:
>        asum += a[i]
> If NA also had the property NA == NA is False, then it would be easy
> to use.

That's what I believe it should do, and I guess this is a strike against the
idea of returning None for a single missing value.

> A mask, on the other hand, would be more difficult for third
> party packages to support. You have to check if the mask is present
> and if so do a mask-aware calculation; if is it not present then you
> have to do a non-mask based calculation.

I actually see the mask as being easier for third party packages to support,
particularly from C. Having regular C-friendly values with a boolean mask is
a lot friendlier than values that require a lot of special casing like the
NA signal values would require.

> So you have two code paths.
> You also need to check if any of the input arrays have masks and if so
> apply masks to the other inputs, etc.

Most of the time, the masks will transparently propagate or not along with
the arrays, with no effort required. In Python, the code you write would be
virtually the same between the two approaches.


> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110623/0ecd2e38/attachment.html 

More information about the NumPy-Discussion mailing list