# [Numpy-discussion] 2d binning and linear regression

Tom Durrant thdurrant@gmail....
Sun Jun 20 03:24:42 CDT 2010

```Hi All,

I have a problem involving lat/lon data.  Basically, I am evaluating
numerical weather model data against satellite data, and trying to produce
gridded plots of various statistics.  There are various steps involved with
this, but basically, I get to the point where I have four arrays of the same
length, containing lat, lon, model and observed values respectively along
the path of the sattelite.

eg:

lat       =  [   50.32   50.78    51.24    51.82   52.55    53.15    53.75
54.28   54.79   55.16  ... ]
lon      =  [ 192.83  193.38  193.94  194.67  195.65  196.49  197.35  198.15
198.94 199.53  ... ]
obs     =  [    1.42      1.35      1.55     1.50      1.59     1.76
2.15       1.90     1.55    0.73  ... ]
model  = [     1.67      1.68     1.70      1.79     2.04     2.36      2.53
2.38      2.149   1.57 ... ]

I then want to calculate statistics based on bins of say 2 X 2 degree boxes.
These arrays are very large, on the order of 10^6. For ease of explanation,
I will focus initially on bias.

My first approach was to use loops through lat/lon boxes and use np.where
statements to extract all the values of the model and observations within
the given box, and calculate the bias as the mean of the difference.  This
was obviously very slow.

histogram2d provided a FAR better way to do this.  i.e.

import numpy as np

latedges=np.arange(-90,90,2)
lonedges=np.arange(0,360,2)

diff = model-obs
grid_N, xedges, yedges = np.histogram2d(lat, lon],
bins=(latedges,lonedges))
grid_bias_sum, xedges, yedges = np.histogram2d(lat, lon, weights=diff,
bins=(latedges,lonedges))
grid_bias = grid_bias_sum/grid_N

I now want to determine the the linear regression coefficients for each each
box after fitting a least squares linear regression to the data in each bin.
I have been looking at using np.digitize to extract the bin indeces, but
haven't had much success trying to do this in two dimensions.  I am back to
looping through the lat and lon box values, using np.where to extract the
observations and model data within that box, and using np.polyfit to obtain
the regression coefficients.  This is, of course, very slow.

Can anyone advise a smarter, vectorized way to do this?  Any thoughts would
be greatly appreciated.