[Numpy-discussion] feedback request: proposal to add masks to the core ndarray
Thu Jun 23 20:32:00 CDT 2011
On Thu, Jun 23, 2011 at 5:21 PM, Mark Wiebe <firstname.lastname@example.org> wrote:
> On Thu, Jun 23, 2011 at 7:00 PM, Nathaniel Smith <email@example.com> wrote:
>> It's should also be possible to accomplish a general solution at the
>> dtype level. We could have a 'dtype factory' used like:
>> np.zeros(10, dtype=np.maybe(float))
>> where np.maybe(x) returns a new dtype whose storage size is x.itemsize
>> + 1, where the extra byte is used to store missingness information.
>> (There might be some annoying alignment issues to deal with.) Then for
>> each ufunc we define a handler for the maybe dtype (or add a
>> special-case to the ufunc dispatch machinery) that checks the
>> missingness value and then dispatches to the ordinary ufunc handler
>> for the wrapped dtype.
> The 'dtype factory' idea builds on the way I've structured datetime as a
> parameterized type, but the thing that kills it for me is the alignment
> problems of 'x.itemsize + 1'. Having the mask in a separate memory block is
> a lot better than having to store 16 bytes for an 8-byte int to preserve the
Hmm. I'm not convinced that this is the best approach either, but let
me play devil's advocate.
The disadvantage of this approach is that masked arrays would
effectively have a 100% memory overhead over regular arrays, as
opposed to the "shadow mask" approach where the memory overhead is
12.5%--100% depending on the size of objects being stored. Probably
the most common case is arrays of doubles, in which case it's 100%
versus 12.5%. So that sucks.
But on the other hand, we gain:
-- simpler implementation: no need to be checking and tracking the
mask buffer everywhere. The needed infrastructure is already built in.
-- simpler conceptually: we already have the dtype concept, it's a
very powerful and we use it for all sorts of things; using it here too
plays to our strengths. We already know what a numpy scalar is and how
it works. Everyone already understands how assigning a value to an
element of an array works, how it interacts with broadcasting, etc.,
etc., and in this model, that's all a missing value is -- just another
-- it composes better with existing functionality: for example,
someone mentioned the distinction between a missing field inside a
record versus a missing record. In this model, that would just be the
difference between dtype([("x", maybe(float))]) and maybe(dtype([("x",
Optimization is important and all, but so is simplicity and
robustness. That's why we're using Python in the first place :-).
If we think that the memory overhead for floating point types is too
high, it would be easy to add a special case where maybe(float) used a
distinguished NaN instead of a separate boolean. The extra complexity
would be isolated to the 'maybe' dtype's inner loop functions, and
transparent to the Python level. (Implementing a similar optimization
for the masking approach would be really nasty.) This would change the
overhead comparison to 0% versus 12.5% in favor of the dtype approach.
More information about the NumPy-Discussion