[Numpy-discussion] feedback request: proposal to add masks to the core ndarray
Fri Jun 24 14:36:02 CDT 2011
On Fri, Jun 24, 2011 at 12:06 PM, Wes McKinney <firstname.lastname@example.org> wrote:
> On Fri, Jun 24, 2011 at 12:33 PM, Mark Wiebe <email@example.com> wrote:
> > On Thu, Jun 23, 2011 at 8:32 PM, Nathaniel Smith <firstname.lastname@example.org> wrote:
> >> On Thu, Jun 23, 2011 at 5:21 PM, Mark Wiebe <email@example.com> wrote:
> >> > On Thu, Jun 23, 2011 at 7:00 PM, Nathaniel Smith <firstname.lastname@example.org>
> >> >> It's should also be possible to accomplish a general solution at the
> >> >> dtype level. We could have a 'dtype factory' used like:
> >> >> np.zeros(10, dtype=np.maybe(float))
> >> >> where np.maybe(x) returns a new dtype whose storage size is
> >> >> + 1, where the extra byte is used to store missingness information.
> >> >> (There might be some annoying alignment issues to deal with.) Then
> >> >> each ufunc we define a handler for the maybe dtype (or add a
> >> >> special-case to the ufunc dispatch machinery) that checks the
> >> >> missingness value and then dispatches to the ordinary ufunc handler
> >> >> for the wrapped dtype.
> >> >
> >> > The 'dtype factory' idea builds on the way I've structured datetime as
> >> > parameterized type, but the thing that kills it for me is the
> >> > problems of 'x.itemsize + 1'. Having the mask in a separate memory
> >> > is
> >> > a lot better than having to store 16 bytes for an 8-byte int to
> >> > the
> >> > alignment.
> >> Hmm. I'm not convinced that this is the best approach either, but let
> >> me play devil's advocate.
> >> The disadvantage of this approach is that masked arrays would
> >> effectively have a 100% memory overhead over regular arrays, as
> >> opposed to the "shadow mask" approach where the memory overhead is
> >> 12.5%--100% depending on the size of objects being stored. Probably
> >> the most common case is arrays of doubles, in which case it's 100%
> >> versus 12.5%. So that sucks.
> >> But on the other hand, we gain:
> >> -- simpler implementation: no need to be checking and tracking the
> >> mask buffer everywhere. The needed infrastructure is already built in.
> > I don't believe this is true. The dtype mechanism would need a lot of
> > to build that needed infrastructure first. The analysis I've done so far
> > indicates the masked approach will give a simpler/cleaner implementation.
> >> -- simpler conceptually: we already have the dtype concept, it's a
> >> very powerful and we use it for all sorts of things; using it here too
> >> plays to our strengths. We already know what a numpy scalar is and how
> >> it works. Everyone already understands how assigning a value to an
> >> element of an array works, how it interacts with broadcasting, etc.,
> >> etc., and in this model, that's all a missing value is -- just another
> >> value.
> > From Python, this aspect of things would be virtually identical between
> > two mechanisms. The dtype approach would require more coding and overhead
> > where you have to create copies of your data to convert it into the
> > parameterized "NA[int32]" dtype, versus with the masked approach where
> > say x.flags.hasmask = True or something like that without copying the
> >> -- it composes better with existing functionality: for example,
> >> someone mentioned the distinction between a missing field inside a
> >> record versus a missing record. In this model, that would just be the
> >> difference between dtype([("x", maybe(float))]) and maybe(dtype([("x",
> >> float)])).
> > Indeed, the difference between an "NA[:x:f4, :y:f4]" versus ":x:NA[f4],
> > :y:NA[f4]" can't be expressed the way I've designed the mask
> > (Note, this struct dtype string representation isn't actually supported
> > NumPy.)
> >> Optimization is important and all, but so is simplicity and
> >> robustness. That's why we're using Python in the first place :-).
> >> If we think that the memory overhead for floating point types is too
> >> high, it would be easy to add a special case where maybe(float) used a
> >> distinguished NaN instead of a separate boolean. The extra complexity
> >> would be isolated to the 'maybe' dtype's inner loop functions, and
> >> transparent to the Python level. (Implementing a similar optimization
> >> for the masking approach would be really nasty.) This would change the
> >> overhead comparison to 0% versus 12.5% in favor of the dtype approach.
> > Yeah, there would be no such optimization for the masked approach. If
> > someone really wants this, they are not precluded from also implementing
> > their own "nafloat" dtype which operates independently of the masking
> > mechanism.
> > -Mark
> >> -- Nathaniel
> >> _______________________________________________
> >> NumPy-Discussion mailing list
> >> NumPy-Discussion@scipy.org
> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> I don't have enough time to engage in this discussion as I'd like but
> I'll give my input.
> I've spent a very amount of time in pandas trying to craft a sensible
> and performant missing-data-handling solution giving existing tools
> which does not get in the user's way and which also works for
> non-floating point data. The result works but isn't completely 100%
> satisfactory, and I went by the Zen of Python in that "practicality
> beats purity". If anyone's interested, have a pore through the pandas
> unit tests for lots of exceptional cases and examples.
> About 38 months ago when I started writing the library now called
> pandas I examined numpy.ma and friends and decided that
> a) the performance overhead for floating point data was not acceptable
> b) numpy.ma does too much for the needs of financial applications,
> say, or in mimicing R's NA functionality (part of why perf suffers)
> c) masked arrays are difficult (imho) for non-expert users to use
> effectively. In my experience, it gets in your way, and subclassing is
> largely to blame for this (along with the mask field and the myriad
> mask-related functions). It's very "complete" from a technical purity
> / computer science-y standpoint by practicality is traded off (re: Zen
> of Python). In R many functions have a flag to handle NA's like na.rm
> and there is the is.na function, along with a few other NA-handling
> functions, and that's it. I wasn't willing to teach my colleagues /
> users of pandas how to use masked arrays (or scikits.timeseries,
> because of numpy.ma reliance) for this reason. I believe that this has
> overall been the right decision.
Having a more R-like interface to the underlying masked implementation is
what I'm leaning towards, with the mask being an implementation detail which
is still accessible, but not at centre stage.
So whatever solution you come up with, you need to dogfood it if
> possible with users who are only at a beginning-to-intermediate level
> of NumPy or Python expertise. Does it get in the way? Does it require
> constant tinkering with masks (if there is a boolean mask versus a
> special NA value)? Intuitive? Hopefully I can take whatever result
> comes of this development effort and change pandas to be implemented
> on top of it without changing the existing API / behavior in any
> significant ways. If I cannot, I will be (very, very) sad.
I'll definitely try to do this as best I can,
> (I don't mean to be overly critical of numpy.ma-- I just care about
> solving problems and making the tools as easy-to-use and intuitive as
> - Wes
> NumPy-Discussion mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion