[Numpy-discussion] ANN: maskedarray
Pierre GM
pgmdevlist@gmail....
Thu Sep 27 10:45:11 CDT 2007
All,
The latest version of maskedarray has just been released on the scipy SVN
sandbox. This version fixes the inconsistencies in filling (see below) and
introduces some minor modifications for optimization purposes (see below as
well). Many thanks to Eric Firing and Matt Knox for the fruitful discussions
at the origin of this release!
In addition, a bench.py file has been introduced, to compare the speed of
numpy.ma and maskedarray. Once again, thanks to Eric for his first draft.
Please feel free to try it and send me some feedback.
Modifications:
* Consistent filling !
In numpy.ma, the division of array A by array B works in several steps:
- A is filled w/ 0
- B is filled w/ 1
- A/B is computed
- the output mask is updated as the combination of A.mask, B.mask and the
domain mask (B==0)
The problems with this approach are that (i) it's not useful to fill A and B
beforehand if the values will be masked anyway; (ii) nothing prevents infs to
show up, as the domain is taken into account at the end only.
In this latest version of maskedarray, the same division is decomposed as:
- a copy of B._data is filled with 1 with the domain (B==0)
- the division of A._data by this copy is computed
- the output mask is updated as the combination of A.mask, B.mask and the
domain mask (B==0).
Prefilling on the domain avoids the presence of nans/infs. However, this comes
with the price of making some functions and methods slower than their numpy.ma
counterparts, as you'll be able to observe for sqrt and log with the bench.py
file. An alternative would be to avoid filling at all, at the risk of leaving
nans and infs.
* masked_invalid / fix_invalid
Two new functions are introduced.
masked_invalid(x) masks x where x is nan or inf.
fix_invalid(x) returns (a copy of) x, where invalid values (nans & infs) are
replaced by fill_value.
* No mask shrinking
Following Paul Dubois and Sasha's example, I eventually had to get rid of the
semi-automatic shrinking of the mask in __getitem__, which appeared to be a
major bottleneck. In other words, one can end up with an array full of False
instead of nomask, which may slow things down a bit. You can force a mask
back to nomask with the new shrink_mask method.
*_sharedmask
Here again, I followed Paul and Sasha's ideas and reintroduce the _sharedmask
flag to prevent inadequate propagation of the mask. When creating a new array
with x=masked_array(data, mask=m), x._mask is initially a reference to m and
x._sharedmask is True. When x is modified, x._mask is copied to prevent a
propagation back to m.
More information about the Numpy-discussion
mailing list