[SciPy-user] Ols for np.arrays and masked arrays

Pierre GM pgmdevlist@gmail....
Fri Jan 16 19:38:55 CST 2009


  Josef,
>
> * get a fast path through the function for (no nans, unmasked)
> np.arrays, that's why I didn't convert inputs automatically to masked
> arrays.
>
> * program basic statistical function for np.arrays without nans. I
> would like to limit the handling of different types of arrays to the
> input and output stages, so that the statistical core part does not
> need to be special cased.
>

Well, you can very well convert your inputs to MaskedArrays only (for  
example through ma.fix_invalid), get rid of the missing values to work  
only w/ standard ndarrays. I'm


> * use compressed not filled to convert masked data, because, in
> general, there is no neutral fill value for regressions. It's also
> easier to use existing functions, for example my version can use the
> standard np.vander.

Indeed.




> I'm not yet very familiar with numpy details, for example when a view
> and when a copy or when intermediate arrays are created and what the
> performance overhead of casting back and forth is.

With a view, you don't create a new array, which is nice if you don't  
intend modifying ti. Creating a masked array version doesn't copy the  
data either, an extra array is sometimes created (the mask), but it  
can be modified relatively safely, modifications shouldn't be  
propagated.


> If we get a general setting for handling different type of arrays,
> then this could be used to wrap standard statistical methods and
> functions without too much extra work.

That depends on the situation again. For regressions, your approach  
works. In other cases, the masked values have to be taken into account  
(because they should be counted as ties, for example). Using masked  
arrays should make it easier to adapt the code to other objects  
(TimeSeries, for example)


>> * if you need to mask an element, just mask it directly: you don't
>> have to set it to NaN and then use np.isnan for the mask. So, instead
>> of:
>> x_0 = x[:,0].copy()
>> x_0[0] = np.nan
>> x_0 = ma.masked_array(x_0, np.isnan(x_0))
>>
>> just do:
>> x_0 = ma.array(x[:,0])
>> x_0[0] = ma.masked
>
> I followed the docs examples. In your way x_0.data still has the
> original value (?), so I wouldn't have run into the problem with
> numpy.testing asserts? Would this hide some test cases?

I've never been happy with what was presented in the docs so far. Now  
that a draft doc for numpy.ma is available, that should change.
In this example, yes, x_0.data[0]  has the same value before and after  
masking, but that's not a problem as the mask will hide it (and that  
you'll drop it anyway later on). However, you want to use the  
numpy.ma.testutils for testing.

>
>>
>> * To get rid of the missing data in x, use x.compressed() or emulate
>> it with x.data[~ma.getmaskarray(x)]. ma.getmaskarray(x) always  
>> returns
>> a ndarray with the same length as x, whereas ma.getmask(x) can return
>> nomask.
>
> this makes shape manipulation and shape preserving compression easier
> it tried this
>     x_0[~ma.getmaskarray(x)]
> and got a masked array back, when I wanted this
>     x_0.data[~ma.getmaskarray(x)]

I saw that. .compressed flattens the data, which is an issue in your  
case. Just selecting elements of .data is more convenient.

>> Actually, after the discussion for 3D picture filling, that it would
> be possible to replace some of the missing values by their predicted
> value or their conditional expectation in a second stage. I think this
> would be the method specific "neutral" fill value.

Except that it won't work, as .filled takes only one element (all the  
masked data are filled w/ the same value). What you wanna do is to use  
putmask on your standard ndarray.







More information about the SciPy-user mailing list