[Numpy-discussion] feedback request: proposal to add masks to the core ndarray
Wed Jun 29 12:53:09 CDT 2011
On Tue, Jun 28, 2011 at 7:34 AM, Lluís <firstname.lastname@example.org> wrote:
> Mark Wiebe writes:
> > The design that's forming is a combination of:
> > * Solve the missing data problem
> > * My ideas of what a good solution looks like:
> > * applies to all NumPy dtypes in a fully general way
> > * high-performance, low overhead where possible
> > * makes the C-level implementation of NumPy nicer to work with, not
> > * easy to use from Python for unskilled programmers
> > * easy to use more powerful functionality from Python for skilled
> > * satisfies all or most of the needs of the many users of arrays with
> a "missing data" aspect to them
> I would add here an efficient mechanism to reinterpret exising data with
> different missing information (no copies of the backing array).
> Although I'm not sure whether this requires first-class citizenship or
I'm calling this idea "masking semantics" generally.
> * All the feedback I'm getting from discussions on the list
> > I've updated a section "Parameterized Data Type With NA Signal Values"
> > in the NEP with an idea for now an NA bit pattern approach could
> > coexist and work together with the mask-based approach. I think I've
> > solved some of the generality and implementation obstacles, it would
> > be great to get some feedback on that.
> Some (obvious) thoughts about it:
> * Trivial to store, as the missing property is encoded in the value
> * Third-party (non-Python) code needs some interface to interpret these
> without having to know the implementation details (although the
> interface is rather trivial).
> * Data marked as missing loses its original value.
> * Reinterpreting the same data (memory buffer) with different missing
> information requires either memory copies or separate mask arrays (see
> So, while it (data types with NA signal values) has its advantages on a
> simpler interaction with 3rd party code and during long-term storage,
> masks will still be needed.
> I think that deciding on the value of NA signal values boils down to
> this question: should 3rd party code be able to interpret missing data
> information stored in the separate mask array?
I'm tossing around some variations of ideas using the iterator to provide a
buffered mask-based interface that works uniformly with both masked arrays
and NA dtypes. This way 3rd party C code only needs to implement one missing
data mechanism to fully support both of NumPy's missing data mechanisms.
> If the answer is no, then 3rd party code should be given a copy of the
> data where the masked array is merged with the ndarray data buffer
> (assuming the original ndarray had a masked array before passing it to
> the 3rd party code). As by definition (?) the ndarray with a mask must
> retain the original data, the result of the 3rd party code must be
> translated back into an ndarray + mask.
> If the answer is yes, then I think the NA signal values just add
> unnecessary complexity, as the 3rd party code will already need to use
> some numpy-specific API to handle missing data through the ndarray
> buffer + mask buffer. This reminds me that if 3rd party were to use the
> new iterator interface, the interface could be twisted in a way that it
> returns only the non-missing parts. For the sake of performance, this
> could be optional, so that the default behaviour is to just iterate
> through non-missing data but an option can be used to iterate over all
> data, and leave missing data handling up to the 3rd party code.
> My 2 cents,
> "And it's much the same thing with knowledge, for whenever you learn
> something new, the whole world becomes that much richer."
> -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
> NumPy-Discussion mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion