[Numpy-discussion] ANN: MaskedArray as a subclass of ndarray - followup

Pierre GM pgmdevlist at gmail.com
Fri Jan 19 16:28:51 CST 2007


Eric, Travis,
Thanks for the words of encouragements :)

I'm all in favor of having maskedarray ported to C, but I won't be able to do 
it myself anytime soon. And I would have to learn C beforehands. Francesc's 
suggestion of using Pyrex  sounds nice, I'll try and see what I can do with 
that

> Moving the implementation to the C-level would be awesome. In particular,
> __getitem__ and __setitem__ are incredibly slow with masked arrays compared
> to ndarrays, so using those inside python loops is basically a really bad
> idea currently. You always have to work with the _data and _mask attributes
> directly if you are concerned about performance.

Well, yeah, that's expected: __getitem__ tests whether the mask is defined 
(not nomask) before trying to access the item. If you're using it in a loop, 
you call the test each time, which is a bad idea. it's indeed far better to 
call the test beforehand, and process _data and _mask separately
A fix would be to force the mask to an array of booleans all the time, but 
that would slow things down elsewhere,as a lot of functions are artificially 
accelerated with the nomask trick. A C implementation may render that trick 
obsolete...
Another possibility would be to force the mask as an bool array, and keep an 
extra flag on top, like hasmask. Hasmask would be False by default, and set 
to True only if the mask is full of False. That'd require a mask.any() in 
__array_finalize__, which might still slow things down.

> Also, there is a "bug" in Pierre's current implementation I spoke with him
> about, but currently have no solution for. numpy.add.accumulate doesn't
> work on arrays from the new maskedarray implementation, but does with the
> old one. 

The fact that it works with 'old' masked arrays doesn't count: they're not 
real ndarrays. They use the __array__ method to communicate with the rest of 
numpy, that we shouldn't need.

> The problem seems to arise when you over-ride __getitem__ in an 
> ndarray sub-class. See the code below for a demonstration:
I'm not sure that's actually the source of the problem.

ufuncs use the __array_wrap__ method to communicate with subclasses. ufuncs 
methods seem to bypass that. In the meantime, the method of the MA.ufuncs 
work as expected.

Could somebody give me some simple explanation about the behaviour of ufuncs 
methods, on the Python side ? I'm obviously missing something here...


More information about the Numpy-discussion mailing list