[Numpy-discussion] feedback request: proposal to add masks to the core ndarray
Fri Jun 24 11:55:42 CDT 2011
On Fri, Jun 24, 2011 at 8:57 AM, Keith Goodman <firstname.lastname@example.org> wrote:
> On Thu, Jun 23, 2011 at 3:24 PM, Mark Wiebe <email@example.com> wrote:
> > On Thu, Jun 23, 2011 at 5:05 PM, Keith Goodman <firstname.lastname@example.org>
> >> On Thu, Jun 23, 2011 at 1:53 PM, Mark Wiebe <email@example.com> wrote:
> >> > Enthought has asked me to look into the "missing data" problem and how
> >> > NumPy
> >> > could treat it better. I've considered the different ideas of adding
> >> > dtype
> >> > variants with a special signal value and masked arrays, and concluded
> >> > that
> >> > adding masks to the core ndarray appears is the best way to deal with
> >> > the
> >> > problem in general.
> >> > I've written a NEP that proposes a particular design, viewable here:
> >> >
> >> >
> >> > There are some questions at the bottom of the NEP which definitely
> >> > discussion to find the best design choices. Please read, and let me
> >> > of
> >> > all the errors and gaps you find in the document.
> >> Wow, that is exciting.
> >> I wonder about the relative performance of the two possible
> >> implementations (mask and NA) in the PEP.
> > I've given that some thought, and I don't think there's a clear way to
> > what the performance gap would be without implementations of both to
> > benchmark against each other. I favor the mask primarily because it
> > masking for all data types in one go with a single consistent interface
> > program against. For adding NA signal values, each new data type would
> > a lot of work to gain the same level of support.
> >> If you are, say, doing a calculation along the columns of a 2d array
> >> one element at a time, then you will need to grab an element from the
> >> array and grab the corresponding element from the mask. I assume the
> >> corresponding data and mask elements are not stored together. That
> >> would be slow since memory access is usually were time is spent. In
> >> this regard NA would be faster.
> > Yes, the masks add more memory traffic and some extra calculation, while
> > NA signal values just require some additional calculations.
> I guess a better example would have been summing along rows instead of
> columns of a large C order array. If one needs to look at both the
> data and the mask then wouldn't summing along rows in cython be about
> as slow as it is currently to sum along columns?
Not quite, both the mask and the array data are being traversed coherently,
so it isn't jumping around in memory like in the columns case you're
> >> I currently use NaN as a missing data marker. That adds things like
> >> this to my cython code:
> >> if a[i] == a[i]:
> >> asum += a[i]
> >> If NA also had the property NA == NA is False, then it would be easy
> >> to use.
> > That's what I believe it should do, and I guess this is a strike against
> > idea of returning None for a single missing value.
> If NA == NA is False then I wouldn't need to look at the mask in the
> example above. Or would ndarray have to look at the mask in order to
> return NA for a[i]? Which would mean __getitem__ would need to look at
> the mask?
What R does is return NA for NA == NA. Then, if you try to use it as a
boolean, it throws an exception. I like this approach.
If the missing value is returned as a 0d array (so that NA == NA is
> False), would that break cython in a fundamental way since it could
> not always return a same-sized scalar when you index into an array?
I don't know enough about Cython internals to comment, sorry.
>> A mask, on the other hand, would be more difficult for third
> >> party packages to support. You have to check if the mask is present
> >> and if so do a mask-aware calculation; if is it not present then you
> >> have to do a non-mask based calculation.
> > I actually see the mask as being easier for third party packages to
> > particularly from C. Having regular C-friendly values with a boolean mask
> > a lot friendlier than values that require a lot of special casing like
> > NA signal values would require.
> >> So you have two code paths.
> >> You also need to check if any of the input arrays have masks and if so
> >> apply masks to the other inputs, etc.
> > Most of the time, the masks will transparently propagate or not along
> > the arrays, with no effort required. In Python, the code you write would
> > virtually the same between the two approaches.
> > -Mark
> >> _______________________________________________
> >> NumPy-Discussion mailing list
> >> NumPy-Discussion@scipy.org
> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> NumPy-Discussion mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion