[Numpy-discussion] feedback request: proposal to add masks to the core ndarray
Charles R Harris
Wed Jun 29 13:04:18 CDT 2011
On Wed, Jun 29, 2011 at 11:53 AM, Mark Wiebe <firstname.lastname@example.org> wrote:
> On Tue, Jun 28, 2011 at 7:34 AM, Lluís <email@example.com> wrote:
>> Mark Wiebe writes:
>> > The design that's forming is a combination of:
>> > * Solve the missing data problem
>> > * My ideas of what a good solution looks like:
>> > * applies to all NumPy dtypes in a fully general way
>> > * high-performance, low overhead where possible
>> > * makes the C-level implementation of NumPy nicer to work with, not
>> > * easy to use from Python for unskilled programmers
>> > * easy to use more powerful functionality from Python for skilled
>> > * satisfies all or most of the needs of the many users of arrays with
>> a "missing data" aspect to them
>> I would add here an efficient mechanism to reinterpret exising data with
>> different missing information (no copies of the backing array).
>> Although I'm not sure whether this requires first-class citizenship or
> I'm calling this idea "masking semantics" generally.
> > * All the feedback I'm getting from discussions on the list
>> > I've updated a section "Parameterized Data Type With NA Signal Values"
>> > in the NEP with an idea for now an NA bit pattern approach could
>> > coexist and work together with the mask-based approach. I think I've
>> > solved some of the generality and implementation obstacles, it would
>> > be great to get some feedback on that.
>> Some (obvious) thoughts about it:
>> * Trivial to store, as the missing property is encoded in the value
>> * Third-party (non-Python) code needs some interface to interpret these
>> without having to know the implementation details (although the
>> interface is rather trivial).
>> * Data marked as missing loses its original value.
>> * Reinterpreting the same data (memory buffer) with different missing
>> information requires either memory copies or separate mask arrays (see
>> So, while it (data types with NA signal values) has its advantages on a
>> simpler interaction with 3rd party code and during long-term storage,
>> masks will still be needed.
>> I think that deciding on the value of NA signal values boils down to
>> this question: should 3rd party code be able to interpret missing data
>> information stored in the separate mask array?
> I'm tossing around some variations of ideas using the iterator to provide a
> buffered mask-based interface that works uniformly with both masked arrays
> and NA dtypes. This way 3rd party C code only needs to implement one missing
> data mechanism to fully support both of NumPy's missing data mechanisms.
;) Also, it avoids a horrible mass of code.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion