[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Nathaniel Smith njs@pobox....
Thu Jun 23 20:32:00 CDT 2011


On Thu, Jun 23, 2011 at 5:21 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:
> On Thu, Jun 23, 2011 at 7:00 PM, Nathaniel Smith <njs@pobox.com> wrote:
>> It's should also be possible to accomplish a general solution at the
>> dtype level. We could have a 'dtype factory' used like:
>>  np.zeros(10, dtype=np.maybe(float))
>> where np.maybe(x) returns a new dtype whose storage size is x.itemsize
>> + 1, where the extra byte is used to store missingness information.
>> (There might be some annoying alignment issues to deal with.) Then for
>> each ufunc we define a handler for the maybe dtype (or add a
>> special-case to the ufunc dispatch machinery) that checks the
>> missingness value and then dispatches to the ordinary ufunc handler
>> for the wrapped dtype.
>
> The 'dtype factory' idea builds on the way I've structured datetime as a
> parameterized type, but the thing that kills it for me is the alignment
> problems of 'x.itemsize + 1'. Having the mask in a separate memory block is
> a lot better than having to store 16 bytes for an 8-byte int to preserve the
> alignment.

Hmm. I'm not convinced that this is the best approach either, but let
me play devil's advocate.

The disadvantage of this approach is that masked arrays would
effectively have a 100% memory overhead over regular arrays, as
opposed to the "shadow mask" approach where the memory overhead is
12.5%--100% depending on the size of objects being stored. Probably
the most common case is arrays of doubles, in which case it's 100%
versus 12.5%. So that sucks.

But on the other hand, we gain:
  -- simpler implementation: no need to be checking and tracking the
mask buffer everywhere. The needed infrastructure is already built in.
  -- simpler conceptually: we already have the dtype concept, it's a
very powerful and we use it for all sorts of things; using it here too
plays to our strengths. We already know what a numpy scalar is and how
it works. Everyone already understands how assigning a value to an
element of an array works, how it interacts with broadcasting, etc.,
etc., and in this model, that's all a missing value is -- just another
value.
  -- it composes better with existing functionality: for example,
someone mentioned the distinction between a missing field inside a
record versus a missing record. In this model, that would just be the
difference between dtype([("x", maybe(float))]) and maybe(dtype([("x",
float)])).

Optimization is important and all, but so is simplicity and
robustness. That's why we're using Python in the first place :-).

If we think that the memory overhead for floating point types is too
high, it would be easy to add a special case where maybe(float) used a
distinguished NaN instead of a separate boolean. The extra complexity
would be isolated to the 'maybe' dtype's inner loop functions, and
transparent to the Python level. (Implementing a similar optimization
for the masking approach would be really nasty.) This would change the
overhead comparison to 0% versus 12.5% in favor of the dtype approach.

-- Nathaniel


More information about the NumPy-Discussion mailing list