[Numpy-discussion] A crazy masked-array thought
Fri Apr 27 11:42:55 CDT 2012
On Apr 25, 2012, at 10:58 AM, Richard Hattersley wrote:
> The masked array discussions have brought up all sorts of interesting topics - too many to usefully list here - but there's one aspect I haven't spotted yet. Perhaps that's because it's flat out wrong, or crazy, or just too awkward to be helpful. But ...
> Shouldn't masked arrays (MA) be a superclass of the plain-old-array (POA)?
Ultimately, this is what Chuck and Mark are advocating, I believe. It's not a crazy idea. In fact, it's probably more correct in that masked arrays *are* more general than POAs. If we were starting from scratch in 1994 (Numeric days), I could see taking this route and setting expectations correctly for downstream libraries.
There are three problems I see with jamming this concept into NumPy 1.X, however, by modifying all POA data-structures to now *be* masked arrays.
1) There is a lot of code out there that does not know anything about masks and is not used to checking for masks. It enlarges the basic abstraction in a way that is not backwards compatible *conceptually*. This smells fishy to me and I could see a lot of downstream problems from libraries that rely on NumPy.
2) We cannot agree on how masks should be handled and consequently don't have a real plan for migrating numpy.ma to use these masks. So, we are just growing the API and introducing uncertainty for unclear benefit --- especially for the person that does not want to use masks.
3) Subclassing in C in Python requires that C-structures are *binary* compatible. This implies that all subclasses have *more* attributes than the superclass. The way it is currently implemented, that means that POAs would have these extra pointers they don't need sitting there to satisfy that requirement. From a C-struct perspective it therefore makes more sense for MAs to inherit from POAs. Ideally, that shouldn't drive the design, but it's part of the landscape in NumPy 1.X
I have some ideas about how to move forward, but I'm anxiously awaiting the write-up that Mark and Nathaniel are working on to inform and enhance those ideas.
Masked arrays do have a long history in the Numeric and NumPy code base. Paul Dubois originally created the first masked arrays in Numeric and helped move them to numpy.ma. Pierre GM took that code and worked very hard to add a lot of features. I'm very concerned about adding a new masked array abstraction into the *core* of all NumPy arrays. Especially one that is not well informed by this history nor its user base.
I was just visiting LLNL a couple of weeks ago and realized that they are using masked arrays very heavily in UV-CDAT and elsewhere. I've also seen many other people in industry, academia, and government use masked arrays. I've typically squirmed at that because I know that masked arrays have performance issues because they are in Python. I've also wondered about masked arrays as *subclasses* of POAs because of how much code has to be rewritten in the sub-class for it to work correctly.
So, in summary. My view is that NumPy has masked arrays already (and has had them for a long-long time). Missing data is only one of the use-cases for masked arrays (though it is probably the dominant use case for numpy.ma). Independent of the "missing-data" story, any plan to add masks directly to a base-object in NumPy needs to take into account the numpy.ma user-base and the POA user-base that does not expect to be dealing with masks.
That doesn't mean it needs to follow numpy.ma design choices and API. It does, however, need to think about how a typical numpy.ma user could instead use the new masked array concept, and how numpy.ma itself could be re-vised to use the new masked array concept.
I think Mark has done some amazing coding and I would like to keep as much of it as possible available to people. We may need to adjust *how* it is presented downstream, but I'm hopeful that we can do that.
Thanks for your ideas and your comments.
> In the library I'm working on, the introduction of MAs (via numpy.ma) required us to sweep through the library and make a fair few changes. That's not the sort of thing one would normally expect from the introduction of a subclass.
> Putting aside the ABI issue, would it help downstream API compatibility if the POA was a subclass of the MA? Code that's expecting/casting-to a POA might continue to work and, where appropriate, could be upgraded in their own time to accept MAs.
> Richard Hattersley
> NumPy-Discussion mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion