[Numpy-discussion] 2d binning and linear regression
Tue Jun 22 14:58:29 CDT 2010
On Tue, Jun 22, 2010 at 10:09 AM, Tom Durrant <email@example.com> wrote:
>> the basic idea is in "polyfit on multiple data points" on
>> numpy-disscusion mailing list April 2009
>> In this case, calculations have to be done by groups
>> subtract mean (this needs to be replaced by group demeaning)
>> modeldm = model - model.mean()
>> obsdm = obs - obs.mean()
>> xx = np.histogram2d(
>> xx, xedges, yedges = np.histogram2d(lat, lon, weights=modeldm*modeldm,
>> xy, xedges, yedges = np.histogram2d(lat, lon, weights=obsdm*obsdm,
>> slopes = xy/xx # slopes by group
>> expand slopes to length of original array
>> predicted = model - obs * slopes_expanded
>> the main point is to get the group functions, for demeaning, ... for
>> the 2d labels (and get the labels out of histogramdd)
>> I'm out of time (off to the airport soon), but I can look into it next
> Thanks Josef, I will chase up the April list...
> If I understand what you have done above, this returns the slope of best fit
> lines forced through the origin, is that right?
Not if both variables, model and obs, are demeaned first, demeaning
removes any effect of a constant and only the slope is left over,
which can be done with the ration xx/xy.
But to get independent intercept per group, the demeaning has to be by group.
What's the size of your problem, how many groups or how many separate
demeaning by group has setup cost in this case, so the main speed
benefit would come if you calculate the digitize and label generation,
that histogram2d does, only ones and reuse it in later calculations.
Using dummy variables as Bruce proposes works very well if there are
not a very large number of groups, otherwise I think the memory
requirements and size of array would be very costly in terms of
> Have a great trip!
> NumPy-Discussion mailing list
More information about the NumPy-Discussion