[Numpy-discussion] feedback request: proposal to add masks to the core ndarray
Mark Wiebe
mwwiebe@gmail....
Fri Jun 24 11:33:35 CDT 2011
On Thu, Jun 23, 2011 at 8:32 PM, Nathaniel Smith <njs@pobox.com> wrote:
> On Thu, Jun 23, 2011 at 5:21 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:
> > On Thu, Jun 23, 2011 at 7:00 PM, Nathaniel Smith <njs@pobox.com> wrote:
> >> It's should also be possible to accomplish a general solution at the
> >> dtype level. We could have a 'dtype factory' used like:
> >> np.zeros(10, dtype=np.maybe(float))
> >> where np.maybe(x) returns a new dtype whose storage size is x.itemsize
> >> + 1, where the extra byte is used to store missingness information.
> >> (There might be some annoying alignment issues to deal with.) Then for
> >> each ufunc we define a handler for the maybe dtype (or add a
> >> special-case to the ufunc dispatch machinery) that checks the
> >> missingness value and then dispatches to the ordinary ufunc handler
> >> for the wrapped dtype.
> >
> > The 'dtype factory' idea builds on the way I've structured datetime as a
> > parameterized type, but the thing that kills it for me is the alignment
> > problems of 'x.itemsize + 1'. Having the mask in a separate memory block
> is
> > a lot better than having to store 16 bytes for an 8-byte int to preserve
> the
> > alignment.
>
> Hmm. I'm not convinced that this is the best approach either, but let
> me play devil's advocate.
>
> The disadvantage of this approach is that masked arrays would
> effectively have a 100% memory overhead over regular arrays, as
> opposed to the "shadow mask" approach where the memory overhead is
> 12.5%--100% depending on the size of objects being stored. Probably
> the most common case is arrays of doubles, in which case it's 100%
> versus 12.5%. So that sucks.
>
> But on the other hand, we gain:
> -- simpler implementation: no need to be checking and tracking the
> mask buffer everywhere. The needed infrastructure is already built in.
>
I don't believe this is true. The dtype mechanism would need a lot of work
to build that needed infrastructure first. The analysis I've done so far
indicates the masked approach will give a simpler/cleaner implementation.
> -- simpler conceptually: we already have the dtype concept, it's a
> very powerful and we use it for all sorts of things; using it here too
> plays to our strengths. We already know what a numpy scalar is and how
> it works. Everyone already understands how assigning a value to an
> element of an array works, how it interacts with broadcasting, etc.,
> etc., and in this model, that's all a missing value is -- just another
> value.
>
>From Python, this aspect of things would be virtually identical between the
two mechanisms. The dtype approach would require more coding and overhead
where you have to create copies of your data to convert it into the
parameterized "NA[int32]" dtype, versus with the masked approach where you
say x.flags.hasmask = True or something like that without copying the data.
> -- it composes better with existing functionality: for example,
> someone mentioned the distinction between a missing field inside a
> record versus a missing record. In this model, that would just be the
> difference between dtype([("x", maybe(float))]) and maybe(dtype([("x",
> float)])).
>
Indeed, the difference between an "NA[:x:f4, :y:f4]" versus ":x:NA[f4],
:y:NA[f4]" can't be expressed the way I've designed the mask functionality.
(Note, this struct dtype string representation isn't actually supported in
NumPy.)
Optimization is important and all, but so is simplicity and
> robustness. That's why we're using Python in the first place :-).
>
> If we think that the memory overhead for floating point types is too
> high, it would be easy to add a special case where maybe(float) used a
> distinguished NaN instead of a separate boolean. The extra complexity
> would be isolated to the 'maybe' dtype's inner loop functions, and
> transparent to the Python level. (Implementing a similar optimization
> for the masking approach would be really nasty.) This would change the
> overhead comparison to 0% versus 12.5% in favor of the dtype approach.
>
Yeah, there would be no such optimization for the masked approach. If
someone really wants this, they are not precluded from also implementing
their own "nafloat" dtype which operates independently of the masking
mechanism.
-Mark
>
> -- Nathaniel
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110624/62640cb9/attachment-0001.html
More information about the NumPy-Discussion
mailing list