[Numpy-discussion] 2d binning and linear regression

josef.pktd@gmai... josef.pktd@gmai...
Sun Jun 20 07:46:39 CDT 2010

```On Sun, Jun 20, 2010 at 4:24 AM, Tom Durrant <thdurrant@gmail.com> wrote:
> Hi All,
> I have a problem involving lat/lon data.  Basically, I am evaluating
> numerical weather model data against satellite data, and trying to produce
> gridded plots of various statistics.  There are various steps involved with
> this, but basically, I get to the point where I have four arrays of the same
> length, containing lat, lon, model and observed values respectively along
> the path of the sattelite.
> eg:
> lat       =  [   50.32   50.78    51.24    51.82   52.55    53.15    53.75
>  54.28   54.79   55.16  ... ]
> lon      =  [ 192.83  193.38  193.94  194.67  195.65  196.49  197.35  198.15
>  198.94 199.53  ... ]
> obs     =  [    1.42      1.35      1.55     1.50      1.59     1.76
> 2.15       1.90     1.55    0.73  ... ]
> model  = [     1.67      1.68     1.70      1.79     2.04     2.36      2.53
>      2.38      2.149   1.57 ... ]
> I then want to calculate statistics based on bins of say 2 X 2 degree boxes.
> These arrays are very large, on the order of 10^6. For ease of explanation,
> I will focus initially on bias.
> My first approach was to use loops through lat/lon boxes and use np.where
> statements to extract all the values of the model and observations within
> the given box, and calculate the bias as the mean of the difference.  This
> was obviously very slow.
> histogram2d provided a FAR better way to do this.  i.e.
>
> import numpy as np
> latedges=np.arange(-90,90,2)
> lonedges=np.arange(0,360,2)
> diff = model-obs
> grid_N, xedges, yedges = np.histogram2d(lat, lon],
>                 bins=(latedges,lonedges))
> grid_bias_sum, xedges, yedges = np.histogram2d(lat, lon, weights=diff,
>                 bins=(latedges,lonedges))
> grid_bias = grid_bias_sum/grid_N
>
> I now want to determine the the linear regression coefficients for each each
> box after fitting a least squares linear regression to the data in each bin.
>  I have been looking at using np.digitize to extract the bin indeces, but
> haven't had much success trying to do this in two dimensions.  I am back to
> looping through the lat and lon box values, using np.where to extract the
> observations and model data within that box, and using np.polyfit to obtain
> the regression coefficients.  This is, of course, very slow.
> Can anyone advise a smarter, vectorized way to do this?  Any thoughts would
> be greatly appreciated.

For a general linear regression problem, there wouldn't be much that I
can see that can be done.

If there are many small regression problem, then sometimes stacking
them into one big sparse least squares problem can be faster, it's
faster to solve but not always faster to create in the first place.

are you doing something like np.polyfit(model, obs, 1) ?

If you are using polyfit with deg=1, i.e. fitting a straight line,
then this could be also calculated using the weights in histogram2d.

histogram2d (histogramdd) uses np.digitize and np.bincount, so I'm
surprised if the histogram2d version is much faster. If a quick
reading of histogramdd is correct, the main improvement would be to
get the labels "xy" out of it, so it can be used repeatedly with
np.bincount.

Josef

> Tom
>
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
```