[SciPy-user] predicting values based on (linear) models

Pierre GM pgmdevlist@gmail....
Thu Jan 15 12:05:04 CST 2009

On Jan 15, 2009, at 12:36 PM, Bruce Southey wrote:
> No! We have had considerable discussion on this aspect in the past on
> the numpy/scipy lists. Basically a missing observation should not be
> treated as an NaNs (and there are different types of NaNs) because  
> they
> are not the same. In some cases, missing values disappear in the
> calculations such as creating the X'X matrix etc but you probably do  
> not
> want that if you have real NaNs in your data (say after taking square
> root of an array that includes negative numbers).

numpy.ma implements equivalents of ufuncs that return a masked array,  
where invalid outputs are masked (the output is invalid if the input  
is masked or if it falls outside the validity domain of the function),  
so we're set. There are functions that mask full rows or columns of a  
2D array, or even get rid of the columns/rows that contain one or  
several missing values which can be used in some cases.

> If you look at R's lm function you can see that you can fit a model
> using a formula. Without a similar framework, you can not do useful
> stats. Also you must have a 'mini model language' because the inputs
> must be created correctly and it gets very repetitive very quickly.

> For example, in R (and all major stats languages like SAS) you can  
> just
> fit regression models like lm(Y~ x2) and  lm( Y~ x3 + x1), where Y,  
> x1,
> x2, and x3 are with the appropriate dataframe (not necessarily in that
> order).

Well, we could adapt the functions to accept a structured array as  
input and define your x1, x2... from the fields of this array. I tried  
to significantly improve the support of structured arrays in numpy.ma  
1.3., so it shouldn't be that difficult to use masked arrays by default.

> If I understand mstats.linregress correctly, I have to create two  
> arrays
> just to fit one of these two models. In the second case, I have to
> create yet another array. If I have my original data in one array,  
> now I
> have unnecessarily duplicated 3 columns of that array not to mention  
> had
> to do all this extra work, hopefully error free, just to do 2 lines  
> of R
> code.

For the first case (Y~x2), you don't need 2 arrays, you can use a 2D  
array with either 2 rows or 2 columns and that would work.  
mstats.linregress use the same approach as stats.linregress.
The second case is a tad more complex, but could probably be adapted  
relatively easily.

> Jonathan's formula is along the right approach but, based on the doc
> string, rather cumbersome and does not use array inputs. It probably
> would be more effective with a record masked array.

OK, more on my todo list...

More information about the SciPy-user mailing list