[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Lluís xscript@gmx....
Tue Jun 28 07:34:07 CDT 2011


Mark Wiebe writes:
> The design that's forming is a combination of:

> * Solve the missing data problem 
> * My ideas of what a good solution looks like:
>    * applies to all NumPy dtypes in a fully general way
>    * high-performance, low overhead where possible
>    * makes the C-level implementation of NumPy nicer to work with, not harder
>    * easy to use from Python for unskilled programmers
>    * easy to use more powerful functionality from Python for skilled programmers
>    * satisfies all or most of the needs of the many users of arrays with a "missing data" aspect to them

I would add here an efficient mechanism to reinterpret exising data with
different missing information (no copies of the backing array).

Although I'm not sure whether this requires first-class citizenship or
not.


> * All the feedback I'm getting from discussions on the list
[...]
> I've updated a section "Parameterized Data Type With NA Signal Values"
> in the NEP with an idea for now an NA bit pattern approach could
> coexist and work together with the mask-based approach. I think I've
> solved some of the generality and implementation obstacles, it would
> be great to get some feedback on that.

Some (obvious) thoughts about it:

* Trivial to store, as the missing property is encoded in the value
  itself.
* Third-party (non-Python) code needs some interface to interpret these
  without having to know the implementation details (although the
  interface is rather trivial).
* Data marked as missing loses its original value.
* Reinterpreting the same data (memory buffer) with different missing
  information requires either memory copies or separate mask arrays (see
  above)

So, while it (data types with NA signal values) has its advantages on a
simpler interaction with 3rd party code and during long-term storage,
masks will still be needed.

I think that deciding on the value of NA signal values boils down to
this question: should 3rd party code be able to interpret missing data
information stored in the separate mask array?

If the answer is no, then 3rd party code should be given a copy of the
data where the masked array is merged with the ndarray data buffer
(assuming the original ndarray had a masked array before passing it to
the 3rd party code). As by definition (?) the ndarray with a mask must
retain the original data, the result of the 3rd party code must be
translated back into an ndarray + mask.

If the answer is yes, then I think the NA signal values just add
unnecessary complexity, as the 3rd party code will already need to use
some numpy-specific API to handle missing data through the ndarray
buffer + mask buffer. This reminds me that if 3rd party were to use the
new iterator interface, the interface could be twisted in a way that it
returns only the non-missing parts. For the sake of performance, this
could be optional, so that the default behaviour is to just iterate
through non-missing data but an option can be used to iterate over all
data, and leave missing data handling up to the 3rd party code.


My 2 cents,
  Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth


More information about the NumPy-Discussion mailing list