[Numpy-discussion] feedback request: proposal to add masks to the core ndarray
Sat Jun 25 15:05:01 CDT 2011
On Sat, Jun 25, 2011 at 6:17 AM, Matthew Brett <firstname.lastname@example.org>wrote:
> On Sat, Jun 25, 2011 at 2:10 AM, Mark Wiebe <email@example.com> wrote:
> > On Fri, Jun 24, 2011 at 7:02 PM, Matthew Brett <firstname.lastname@example.org>
> > wrote:
> >> Hi,
> >> On Sat, Jun 25, 2011 at 12:22 AM, Wes McKinney <email@example.com>
> >> wrote:
> >> ...
> >> > Perhaps we should make a wiki page someplace summarizing pros and cons
> >> > of the various implementation approaches?
> >> But - we should do this if it really is an open question which one we
> >> go for. If not then, we're just slowing Mark down in getting to the
> >> implementation.
> >> Assuming the question is still open, here's a starter for the pros and
> >> cons:
> >> array.mask
> >> 1) It's easier / neater to implement
> > Yes
> >> 2) It can generalize across dtypes
> > Yes
> >> 3) You can still get the masked data underneath the mask (allowing you
> >> to unmask etc)
> > By setting up views appropriately, yes. If you don't have another view to
> > the underlying data, you can't get at it.
> >> nafloat64:
> >> 1) No memory overhead
> > Yes
> >> 2) Battle-tested implementation already done in R
> > We can't really use that though, R is GPL and NumPy is BSD. The
> > implementation details are likely different enough that a
> > would be needed anyway.
> Right - I wasn't suggesting using the code, only that the idea can be
> made to work coherently with an API that seems to have won friends
> over time.
OK, so I think you mean a battle-tested implementation of the interface R
exposes. That interface can be implemented with either masks or NA bit
patterns, I don't believe it has anything specific to bit patterns inherent
> >> I guess we'd have to test directly whether the non-continuous memory
> >> of the mask and data would cause enough cache-miss problems to
> >> outweigh the potential cycle-savings from single byte comparisons in
> >> array.mask.
> > The different memory buffers are each contiguous, so the access patterns
> > still have a lot of coherency. I intend to give the mask memory layouts
> > matching those of the arrays.
> >> I guess that one and only one of these will get written. I guess that
> >> one of these choices may be a lot more satisfying to the current and
> >> future masked array itch than the other.
> > I'm only going to implement one solution, yes.
> >> I'm personally worried that the memory overhead of array.masks will
> >> make many of us tend to avoid them. I work with images that can
> >> easily get large enough that I would not want an array-items size byte
> >> array added to my storage.
> > May I ask what kind of dtypes and sizes you're working with?
> dtypes for images usually end up as floats - float32 or float64. On
> disk, and when memory mapped, they are often int16 or uint16. Sizes
> vary from fairly small 3D images of say 64 x 64 x 32 (1M in float64)
> to rather large 4D images - say 256 x 256 x 50 x 500 at the very high
> end (12.5G in float64).
OK, so the mask would be an extra 128KB or 1.6G, respectively.
>> The reason I'm asking for more details about the implementation is
> >> because that is most of the argument for array.mask at the moment (1
> >> and 2 above).
> > I'm first trying to nail down more of the higher level requirements
> > digging really deep into the implementation details. They greatly affect
> > those details have to turn out.
> Once you've started with the array.mask framework, you've committed
> yourself to the memory hit, and you may lose potential users who often
> hit memory limits. My guess is that no-one currently using np.ma is
> in that category, because it also uses a separate mask array, as I
> understand it.
In the same way, if I start with the NA bit pattern framework, I've
committed to throwing away the underlying values, and I will lose potential
users who want to keep them. This tradeoff goes both ways, it looks like
nobody would be completely satisfied with only one of the two approaches.
> See you,
> NumPy-Discussion mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion