[Numpy-discussion] 2d binning and linear regression
Mon Jun 21 09:38:08 CDT 2010
On 06/20/2010 03:24 AM, Tom Durrant wrote:
> Hi All,
> I have a problem involving lat/lon data. Basically, I am evaluating
> numerical weather model data against satellite data, and trying to
> produce gridded plots of various statistics. There are various steps
> involved with this, but basically, I get to the point where I have
> four arrays of the same length, containing lat, lon, model and
> observed values respectively along the path of the sattelite.
> lat = [ 50.32 50.78 51.24 51.82 52.55 53.15
> 53.75 54.28 54.79 55.16 ... ]
> lon = [ 192.83 193.38 193.94 194.67 195.65 196.49 197.35
> 198.15 198.94 199.53 ... ]
> obs = [ 1.42 1.35 1.55 1.50 1.59 1.76
> 2.15 1.90 1.55 0.73 ... ]
> model = [ 1.67 1.68 1.70 1.79 2.04 2.36
> 2.53 2.38 2.149 1.57 ... ]
> I then want to calculate statistics based on bins of say 2 X 2 degree
> boxes. These arrays are very large, on the order of 10^6. For ease of
> explanation, I will focus initially on bias.
> My first approach was to use loops through lat/lon boxes and use
> np.where statements to extract all the values of the model and
> observations within the given box, and calculate the bias as the mean
> of the difference. This was obviously very slow.
> histogram2d provided a FAR better way to do this. i.e.
> import numpy as np
> diff = model-obs
> grid_N, xedges, yedges = np.histogram2d(lat, lon],
> grid_bias_sum, xedges, yedges = np.histogram2d(lat, lon, weights=diff,
> grid_bias = grid_bias_sum/grid_N
> I now want to determine the the linear regression coefficients for
> each each box after fitting a least squares linear regression to the
> data in each bin. I have been looking at using np.digitize to extract
> the bin indeces, but haven't had much success trying to do this in two
> dimensions. I am back to looping through the lat and lon box values,
> using np.where to extract the observations and model data within that
> box, and using np.polyfit to obtain the regression coefficients. This
> is, of course, very slow.
> Can anyone advise a smarter, vectorized way to do this? Any thoughts
> would be greatly appreciated.
> Thanks in advance
What exactly are trying to fit because it is rather bad practice to fit
a model to some summarized data as you lose the uncertainty in the
If you define your boxes, you can loop through directly on each box and
even fit the equation:
The extension is to fit the larger equation:
model=mu + boxes + beta1*obs + beta2*obs*boxes
where your boxes is a indicator or dummy variable for each box.
Since you are only interested in the box by model term, you probably can
use this type of model
model=mu + boxes + beta2*obs*boxes
However, these models assume that the residual variance is identical for
all boxes. (That is solved by a mixed model approach.)
More information about the NumPy-Discussion