[Numpy-discussion] feedback request: proposal to add masks to the core ndarray
Thu Jun 23 17:05:21 CDT 2011
On Thu, Jun 23, 2011 at 1:53 PM, Mark Wiebe <firstname.lastname@example.org> wrote:
> Enthought has asked me to look into the "missing data" problem and how NumPy
> could treat it better. I've considered the different ideas of adding dtype
> variants with a special signal value and masked arrays, and concluded that
> adding masks to the core ndarray appears is the best way to deal with the
> problem in general.
> I've written a NEP that proposes a particular design, viewable here:
> There are some questions at the bottom of the NEP which definitely need
> discussion to find the best design choices. Please read, and let me know of
> all the errors and gaps you find in the document.
Wow, that is exciting.
I wonder about the relative performance of the two possible
implementations (mask and NA) in the PEP.
If you are, say, doing a calculation along the columns of a 2d array
one element at a time, then you will need to grab an element from the
array and grab the corresponding element from the mask. I assume the
corresponding data and mask elements are not stored together. That
would be slow since memory access is usually were time is spent. In
this regard NA would be faster.
I currently use NaN as a missing data marker. That adds things like
this to my cython code:
if a[i] == a[i]:
asum += a[i]
If NA also had the property NA == NA is False, then it would be easy
to use. A mask, on the other hand, would be more difficult for third
party packages to support. You have to check if the mask is present
and if so do a mask-aware calculation; if is it not present then you
have to do a non-mask based calculation. So you have two code paths.
You also need to check if any of the input arrays have masks and if so
apply masks to the other inputs, etc.
More information about the NumPy-Discussion