[Numpy-discussion] Re: ndarray.fill and ma.array.filled
Pierre GM
pgmdevlist at mailcan.com
Fri Apr 7 15:54:01 CDT 2006
Folks,
I'm more or less in Eric's field (hydrology), and we do have to deal with
missing values, that we can't interpolate straightforwardly (that is, without
some dark statistical magic). Purely discarding the data is not an option
either. MA fills the need, most of it.
I think one of the issues is what is meant by 'masked data':
- a missing observation ?
- a NAN ?
- a data we don't want to consider at one particular point ?
For the last point, think about raster maps or bitmaps: calculations should be
performed on a chunk of data, the initial data left untouched, and the result
should both have the same size as the original, and valid only on the initial
chunk. The current MA implementation, with its _data part and is _mask part,
works nicely for the 3rd point.
- I wonder whether implementing a 'filled' method for ndarrays is really
better than letting the user create a MaskedArray, where the NANs are
masked.In any case, a 'filled' method should always return a copy, as it's no
longer the initial data.
- I'm not sure what to do with the idea of making ndarray a subclass of MA .
One on side, Tim pointed rightly that a ndarray is just a MA with a 'False'
mask. Actually, I'm a bit frustrated with the standard 'asarray' that shows
up in many functions. I'd prefer something like "if the argument is a
non-numpy sequence (tuples,lists), transforming it in a ndarray, but if it's
already a ndarray or a MA, leave it as it is. Don't touch the mask if
present". That's how MA.asarray works, but unfortunately the std "asarray"
gets rid of the mask (and you end up with something which is not what you'd
expect). A 'mask=False' attribute in ndarray would be nice.
On another, some methods/functions make sense only on unmasked ndarray (FFT,
solving equations), some others are a bit tricky to implement (diff ?
median...). Some exception could be raised if the arguments of these
functions return True with ismasked (cf below), or that could be simplified
if 'mask' was a default attribute of numarrays.
I regularly have to use a ismasked function (cf below).
def ismasked(a):
if hasattr(a,'mask'):
return a.mask.any()
else:
return False
We're going towards MA as the default object.
But then again, what would be the behavior to deal with missing values ? Using
R-like na.actions ? That'd be great, but it's getting more complex.
Oh, and another thing: if 'mask', or 'masked' becomes a default attribute of
ndarrays, how do we define a mask? As a boolean ndarray whose 'mask' is
always 'False' ? How do you __repr__ it ?
- I agree that 'filled_value' is not very useful. If I want to fill an array,
I'm happy to specify what value I want it filled with. In facts, I'd be
happier to specifiy 'values'. I often have to work with 2D arrays, each
column representing a different variable. If this array has to be filled, I'd
like each column to be filled with one particular value, not necessarily the
same along all columns: something like
column_stack([A[:,k].filled(filler[k]) for k in range(A.shape[1])])
with filler a 1xA.shape[1] array of filling values. Of course, we could
imagine the same thing for rows, or higher dimensions...
Sorry for the rants...
More information about the Numpy-discussion
mailing list