[SciPy-User] Weighted KDE
Jackson Li
sonicatedboom-s@yahoo....
Sun Jan 13 10:08:50 CST 2013
<josef.pktd <at> gmail.com> writes:
>
> On Sun, May 13, 2012 at 1:07 PM, Zachary Pincus <zachary.pincus <at> yale.edu>
wrote:
> > Hello all,
> >
> > A while ago, someone asked on this list about whether it would be simple to
modify
> scipy.stats.kde.gaussian_kde to deal with weighted data:
> > http://mail.scipy.org/pipermail/scipy-user/2008-November/018578.html
> >
> > Anne and Robert assured the writer that this was pretty simple (modulo
bandwidth selection), though I
> couldn't find any code that the original author may have generated based on
that advice.
> >
> > I've got a problem that could (perhaps) be solved neatly with weighed KDE,
so I'd like to give this a go. I
> assume that at a minimum, to get basic gaussian_kde.evaluate() functionality:
> >
> > (1) The covariance calculation would need to be replaced by a weighted-
covariance calculation. (Simple enough.)
> >
> > (2) In evaluate(), the critical part looks like this (and a similar stanza
that loops over the points instead):
> > # if there are more points than data, so loop over data
> > for i in range(self.n):
> > diff = self.dataset[:, i, newaxis] - points
> > tdiff = dot(self.inv_cov, diff)
> > energy = sum(diff*tdiff,axis=0) / 2.0
> > result = result + exp(-energy)
> >
> > I assume that, further, the 'diff' values ought to be scaled by the weights,
too. Is this all that would need
> to be done? (For the integration and resampling, obviously, there would be a
bit of other work...)
>
> it looks to me that way, scaled according to weight by dataset points
>
> I don't see what the norm_factor should be:
> self._norm_factor = sqrt(linalg.det(2*pi*self.covariance)) * self.n
> there should be the weights somewhere in there, maybe just replace
> self.n by sum(weights) given a constant covariance
>
> sampling doesn't look difficult, if we want biased sampling, then
> instead of randint, we would need weighted randint (non-uniform)
>
> integration might require more work, or not (I never tried to understand them)
>
> (I don't know if kde in statsmodels has weights on the schedule.)
>
> Josef
> mostly guessing
>
> >
> > Thanks,
> > Zach
> > _______________________________________________
> > SciPy-User mailing list
> > SciPy-User <at> scipy.org
> > http://mail.scipy.org/mailman/listinfo/scipy-user
>
Hi,
I am facing the same problem as well, but can't figure out how the weighting
should be done exactly.
Has anybody successfully completed the modification of the code to allow a
weighted kde? I am attempting to perform kde on a set of imaging data with X, Y,
and an additional "temperature" column.
Performing the kde on only the X,Y axes gives a working heatmap showing the
spatial distribution of the data points, but I would also like to use them to
see the "temperature" profile (the third axis), much like a geographical heatmap
showing temperature or rainfall values over a X-Y map.
I found another set of code from
http://pastebin.com/LNdYCZgw
which allows weighted kde, but when I tried it out with my data, it took much
longer than the normal kde (>1 hour) when the original code took only a about
twenty seconds (despite claims that it was faster).
Thanks,
Jackson
More information about the SciPy-User
mailing list