[SciPy-User] Weighted KDE

Jackson Li sonicatedboom-s@yahoo....
Sun Jan 13 10:08:50 CST 2013


 <josef.pktd <at> gmail.com> writes:

> 
> On Sun, May 13, 2012 at 1:07 PM, Zachary Pincus <zachary.pincus <at> yale.edu> 
wrote:
> > Hello all,
> >
> > A while ago, someone asked on this list about whether it would be simple to 
modify
> scipy.stats.kde.gaussian_kde to deal with weighted data:
> > http://mail.scipy.org/pipermail/scipy-user/2008-November/018578.html
> >
> > Anne and Robert assured the writer that this was pretty simple (modulo 
bandwidth selection), though I
> couldn't find any code that the original author may have generated based on 
that advice.
> >
> > I've got a problem that could (perhaps) be solved neatly with weighed KDE, 
so I'd like to give this a go. I
> assume that at a minimum, to get basic gaussian_kde.evaluate() functionality:
> >
> > (1) The covariance calculation would need to be replaced by a weighted-
covariance calculation. (Simple enough.)
> >
> > (2) In evaluate(), the critical part looks like this (and a similar stanza 
that loops over the points instead):
> > # if there are more points than data, so loop over data
> > for i in range(self.n):
> >    diff = self.dataset[:, i, newaxis] - points
> >    tdiff = dot(self.inv_cov, diff)
> >    energy = sum(diff*tdiff,axis=0) / 2.0
> >    result = result + exp(-energy)
> >
> > I assume that, further, the 'diff' values ought to be scaled by the weights, 
too. Is this all that would need
> to be done? (For the integration and resampling, obviously, there would be a 
bit of other work...)
> 
> it looks to me that way, scaled according to weight by dataset points
> 
> I don't see what the norm_factor should be:
>       self._norm_factor = sqrt(linalg.det(2*pi*self.covariance)) * self.n
> there should be the weights somewhere in there, maybe just replace
> self.n by sum(weights) given a constant covariance
> 
> sampling doesn't look difficult, if we want biased sampling, then
> instead of randint, we would need weighted randint (non-uniform)
> 
> integration might require more work, or not (I never tried to understand them)
> 
> (I don't know if kde in statsmodels has weights on the schedule.)
> 
> Josef
> mostly guessing
> 
> >
> > Thanks,
> > Zach
> > _______________________________________________
> > SciPy-User mailing list
> > SciPy-User <at> scipy.org
> > http://mail.scipy.org/mailman/listinfo/scipy-user
> 

Hi,

I am facing the same problem as well, but can't figure out how the weighting 
should be done exactly.

Has anybody successfully completed the modification of the code to allow a 
weighted kde? I am attempting to perform kde on a set of imaging data with X, Y, 
and an additional "temperature" column.

Performing the kde on only the X,Y axes gives a working heatmap showing the 
spatial distribution of the data points, but I would also like to use them to 
see the "temperature" profile (the third axis), much like a geographical heatmap 
showing temperature or rainfall values over a X-Y map.

I found another set of code from 
http://pastebin.com/LNdYCZgw
which allows weighted kde, but when I tried it out with my data, it took much 
longer than the normal kde (>1 hour) when the original code took only a about 
twenty seconds (despite claims that it was faster). 

Thanks,
Jackson







More information about the SciPy-User mailing list