[Numpy-discussion] 2d binning and linear regression
Tue Jun 22 10:53:53 CDT 2010
On 06/22/2010 09:13 AM, Tom Durrant wrote:
> What exactly are trying to fit because it is rather bad practice
> to fit
> a model to some summarized data as you lose the uncertainty in the
> original data?
> If you define your boxes, you can loop through directly on each
> box and
> even fit the equation:
> model=mu +beta1*obs
> The extension is to fit the larger equation:
> model=mu + boxes + beta1*obs + beta2*obs*boxes
> where your boxes is a indicator or dummy variable for each box.
> Since you are only interested in the box by model term, you
> probably can
> use this type of model
> model=mu + boxes + beta2*obs*boxes
> However, these models assume that the residual variance is
> identical for
> all boxes. (That is solved by a mixed model approach.)
> I am trying to determine spatially based linear corrections for
> surface winds in order to force a wave model. The basic idea is, use
> satellite observations from sattelites to determine the errors and the
> wind, and reduce them by applying a linear correction prior to forcing
> the wave model.
> I am not sure I understand what you are saying, I am possibly trying
> to do what you are describing. i.e. for each box, gather
> observations, determine a linear correction, and apply it to the model
> model = a*x + b
> Does that make sense?
I used the data you gave and may have swapped 'model' and 'x' here - the
code should work if you switch these.
First I assume that you can create a variable 'box' based on the lat/lon
data - I just created one from the first values.
As I understand it, your problem is essentially is an analysis of
covariance with one factor (box) and one regressor (x). So I used some
technical knowledge about the parameterization of the normal equations
that may not be true in general for all models as it depends on finding
'equivalent models'. Just print out the different arrays for a small
example as it is rather hard to describe using words alone.
Basically you can create a design matrix that just includes dummy
variables for each box and the interactions between that dummy variable
and your 'x' array. This amounts to the equation:
y = box + box*x
y is your 'model' array
box is a 'dummy variable of box - number of columns is the number of boxes.
x is your 'x' array.
Note that there is no general or overall intercept here because you are
not interested in that. Rather you are interested in the 'intercept' for
each box which comes from the corresponding solution.
The code provides a function to create the dummy variables for each box
and a second function to compute the interactions between that dummy
variable and your 'x' array - these functions originated in
pystatsmodels. After that I form and solve the normal equations to get
the standard errors of the solutions and other useful statistics. Note
that the residual variance (MSE) is probably the pooled residual
variance across all boxes (I am too lazy to check).
Also, if you are not interested in the standard errors etc. then you
probably should use a more efficient solver available in numpy.
I (hopefully) reshaped the solutions ('beta') and standard errors
('pSE') so the rows are for each box, the first column is the
'intercept' and the second column is the 'slope'.
The output is:
$ python reg_box.py
Residual Sum of Squares 0.0306718851814
Residual Mean Sum of Squares 0.00511198086357
Estimated Intercept and regression for each box
[[ 0.69044944 0.44569288]
Estimated standard error of the Intercept and regression for each box
[[ 0.41081864 0.23061509]
[ 0.21123387 0.095004 ]]
a=0.45 (se=0.23) and b=0.69 (se= 0.41)
a=1.43 (se=0.095) and b=-1.53 (se=0.21)
(If you have questions, you can contact me off list if you want.)
-------------- next part --------------
An HTML attachment was scrubbed...
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 1754 bytes
Desc: not available
Url : http://mail.scipy.org/pipermail/numpy-discussion/attachments/20100622/20339b77/attachment.py
More information about the NumPy-Discussion