[SciPy-user] Ols for np.arrays and masked arrays

josef.pktd@gmai... josef.pktd@gmai...
Fri Jan 16 18:53:48 CST 2009

Thanks for the explanations, it was quite a bit of trial and error to
find out, especially how the dimension handling and casting works.

my basic idea was:

* get a fast path through the function for (no nans, unmasked)
np.arrays, that's why I didn't convert inputs automatically to masked

* program basic statistical function for np.arrays without nans. I
would like to limit the handling of different types of arrays to the
input and output stages, so that the statistical core part does not
need to be special cased.

* use compressed not filled to convert masked data, because, in
general, there is no neutral fill value for regressions. It's also
easier to use existing functions, for example my version can use the
standard np.vander.

This works for calculating summary statistics, parameter estimates,
covariance matrices, test statistics and so on, but for transformation
of the input variables, error vector, predicted values, keeping masked
arrays might be necessary or more convenient.

I'm not yet very familiar with numpy details, for example when a view
and when a copy or when intermediate arrays are created and what the
performance overhead of casting back and forth is.

If we get a general setting for handling different type of arrays,
then this could be used to wrap standard statistical methods and
functions without too much extra work.

On Fri, Jan 16, 2009 at 5:13 PM, Pierre GM <pgmdevlist@gmail.com> wrote:
> Josef,
> I'm rewriting your module, expect some update in the next few hours
> (days...).
> Still, I have some generic comments:
> * If you need a function that supports both ndarrays and masked
> arrays, force the inputs to be masked arrays, that'll be easier.
> * Use ma.fix_invalid to transform a ndarray w/ or w/o NaNs into a
> masked array: the NaNs will automatically be masked, and the
> underlying data fixed (a copy is made, no worry).
> * if you need to mask an element, just mask it directly: you don't
> have to set it to NaN and then use np.isnan for the mask. So, instead
> of:
> x_0 = x[:,0].copy()
> x_0[0] = np.nan
> x_0 = ma.masked_array(x_0, np.isnan(x_0))
> just do:
> x_0 = ma.array(x[:,0])
> x_0[0] = ma.masked

I followed the docs examples. In your way x_0.data still has the
original value (?), so I wouldn't have run into the problem with
numpy.testing asserts? Would this hide some test cases?

> * When mask is a boolean ndarray, just use x[~mask] instead of
> x[mask==False].

I didn't remember  `~`

> * To get rid of the missing data in x, use x.compressed() or emulate
> it with x.data[~ma.getmaskarray(x)]. ma.getmaskarray(x) always returns
> a ndarray with the same length as x, whereas ma.getmask(x) can return
> nomask.

this makes shape manipulation and shape preserving compression easier
it tried this
and got a masked array back, when I wanted this

> * when manipulating masked arrays, if performance is an issue,
> decompose the process in manipulating the data and the mask
> separately. The easiest is to use .filled to get a pure ndarray for
> the data. The choice of the fill_value depends on the application. In
> ma.polyfit, we fill y with 0, which doesn't really matter as the
> corresponding coefficients of x will be 0 (through vander).

compressed might be necessary, see above

>> Also this is the first time, that
>> I use masked arrays, and I'm not sure I found the best way.
> Don't worry, practice makes perfect.
>> I treat masked arrays or nans by removing all observations with masked
>> or nan values for the internal calculation, but keep the mask around,
>> for the case when it is needed for some output.
> You keep the *common* mask, which sounds OK. Removing the missing
> observations seems the way to go.

Actually, after the discussion for 3D picture filling, that it would
be possible to replace some of the missing values by their predicted
value or their conditional expectation in a second stage. I think this
would be the method specific "neutral" fill value.

>> I looked at
>> np.ma.polyfit, which uses a dummy fill value (0) before calling least
>> squares, but in general it will be difficult to find "neutral" fill
>> values..
> cf explanation above.
> _______________________________________________
> SciPy-user mailing list
> SciPy-user@scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-user


More information about the SciPy-user mailing list