[Numpy-discussion] 2d binning and linear regression

Bruce Southey bsouthey@gmail....
Tue Jun 22 10:53:53 CDT 2010

On 06/22/2010 09:13 AM, Tom Durrant wrote:
>     >
>     What exactly are trying to fit because it is rather bad practice
>     to fit
>     a model to some summarized data as you lose the uncertainty in the
>     original data?
>     If you define your boxes, you can loop through directly on each
>     box and
>     even fit the equation:
>     model=mu +beta1*obs
>     The extension is to fit the larger equation:
>     model=mu + boxes + beta1*obs + beta2*obs*boxes
>     where your boxes is a indicator or dummy variable for each box.
>     Since you are only interested in the box by model term, you
>     probably can
>     use this type of model
>     model=mu + boxes + beta2*obs*boxes
>     However, these models assume that the residual variance is
>     identical for
>     all boxes. (That is solved by a mixed model approach.)
>     Bruce
> Bruce,
> I am trying to determine spatially based linear corrections for 
> surface winds in order to force a wave model.  The basic idea is, use 
> satellite observations from sattelites to determine the errors and the 
> wind, and reduce them by applying a linear correction prior to forcing 
> the wave model.
> I am not sure I understand what you are saying, I am possibly trying 
> to do what you are describing.  i.e. for each box, gather 
> observations, determine a linear correction, and apply it to the model
> model = a*x + b
> Does that make sense?
> Cheers
> Tom

I used the data you gave and may have swapped 'model' and 'x' here - the 
code should work if you switch these.
First I assume that you can create a variable 'box' based on the lat/lon 
data - I just created one from the first values.

As I understand it, your problem is essentially is an analysis of 
covariance with one factor (box) and one regressor (x). So I used some 
technical knowledge about the parameterization of the normal equations 
that may not be true in general for all models as it depends on finding 
'equivalent models'.  Just print out the different arrays for a small 
example as it is rather hard to describe using words alone.

Basically you can create a design matrix that just includes dummy 
variables for each box and the interactions between that dummy variable 
and your 'x' array. This amounts to the equation:

y = box + box*x

y is your 'model' array
box is a 'dummy variable of box - number of columns is the number of boxes.
x is your 'x' array.

Note that there is no general or overall intercept here because you are 
not interested in that. Rather you are interested in the 'intercept' for 
each box which comes from the corresponding solution.

The code provides a function to create the dummy variables for each box 
and a second function to compute the interactions between that dummy 
variable and your 'x' array - these functions originated in 
pystatsmodels. After that I form and solve the normal equations to get 
the standard errors of the solutions and other useful statistics. Note 
that the residual variance (MSE) is probably the pooled residual 
variance across all boxes (I am too lazy to check).

Also, if you are not interested in the standard errors etc. then you 
probably should use a more efficient solver available in numpy.

I (hopefully) reshaped the solutions ('beta') and standard errors 
('pSE') so the rows are for each box, the first column is the 
'intercept' and the second column is the 'slope'.

The output is:

$ python reg_box.py
Residual Sum of Squares      0.0306718851814
Residual Mean Sum of Squares 0.00511198086357
RSquared                     0.975657233983
Estimated Intercept and regression for each box
[[ 0.69044944  0.44569288]
  [-1.53272813  1.43358273]]
Estimated standard error of the Intercept and regression for each box
[[ 0.41081864  0.23061509]
  [ 0.21123387  0.095004  ]]

For box=1:
a=0.45 (se=0.23) and b=0.69 (se= 0.41)

For box=2:
a=1.43 (se=0.095) and b=-1.53 (se=0.21)

(If you have questions, you can contact me off list if you want.)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20100622/20339b77/attachment.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: reg_box.py
Type: text/x-python
Size: 1754 bytes
Desc: not available
Url : http://mail.scipy.org/pipermail/numpy-discussion/attachments/20100622/20339b77/attachment.py 

More information about the NumPy-Discussion mailing list