[SciPy-user] Width of the gaussian in stats.kde.gaussian_kde ?
Anne Archibald
peridot.faceted@gmail....
Thu Sep 18 14:41:40 CDT 2008
2008/9/18 <torii@gmx.com>:
> I used the kernel-density estimate to make some 2D density plots
> (stats.kde.gaussian_kde) and I was very happy with the result.
>
> But when I do the same exercise over a much larger area, I completely lost
> the details I had with my previous analysis... If I understand correctly,
> this is related to the adaptation of the elementary gaussian to the scale of
> my dataset which now includes large areas with almost no data.
>
> Question: is there a way to control the width of the gaussian in
> stats.kde.gaussian_kde? or should I switch to another technique?
>
> Note: I am both a newbie in python and stats...
Kernel-density estimates approximate the distribution of your data by
placing a copy of the kernel at each data point. Scipy's gaussian_kde
uses multidimensional gaussians as the kernel. It provides various
features, including reasonably efficient evaluation, integration over
boxes and against gaussians and other gaussian KDEs, and most
relevantly, automatic selection of the covariance matrix of the
kernel. There are a number of different ways to do this automatic
selection in the statistical literature, and the one implemented in
gaussian_kde is appropriate for a unimodal distribution: it uses the
covariance matrix of your data, scaled by a factor depending on the
dimensionality and number of data points. If, however, you have a data
set consisting of several narrow widely-separated peaks, this will
give a needlessly blurry estimate.
A more powerful standard tool for choosing the variance in one
dimension is to use a "plug-in estimator". The idea is to try to
choose the variance that minimizes the estimated mean-squared error.
The mean squared error comes from two things: the randomness of the
original sampling, and the smoothing by the gaussians. If you choose a
very broad kernel, you smooth out almost all the noise, but you have
lots of mean-squared error because you've smoothed away any features
the distribution really had. On the other hand, if you use a very
narrow kernel, you haven't smoothed the distribution much at all, but
now the randomness of your points wrecks things. So you have to select
some optimal kernel. Ideally, you could run a numerical minimization
by evaluating the mean squared error for each trial variance.
Unfortunately this requires you to know the true distribution. That's
where the "plug-in estimator" comes in: you choose some crude way to
estimate the distribution, and use this crude estimate in your
mean-squared error calculations to get the best variance. You then use
this best variance as your kernel width.
What does this have to do with you? Well, unfortunately, plug-in
estimators are not implemented in scipy, probably in part because it's
difficult enough in one dimension, and a real horror in several. You
could try to implement it (after doing some reading in the statistical
literature) but I suspect that's not how you want to address the
problem.
I suggest you choose a "representative" part of your data that
consists of a single peak, feed it into gaussian_kde, and extract the
variance. Then create a gaussian_kde for the whole data set using that
same variance. You can optionally fudge the covariance as necessary in
between.
The first part, extracting the covariance from a gaussian_kde
instance, should be easy: as soon as you've constructed the
gaussian_kde it is stored in the covariance attribute. The second
part, building a gaussian_kde instance with set covariance, is going
to require a little hacking. Here's how I'd do it (untested):
class gaussian_kde_set_covariance(scipy.stats.gaussian_kde):
def __init__(self, dataset, covariance):
self.covariance = covariance
scipy.stats.gaussian_kde.__init__(self, dataset)
def _compute_covariance(self):
self.inv_cov = linalg.inv(self.covariance)
self._norm_factor = sqrt(linalg.det(2*pi*self.covariance)) * self.n
This creates a derived class from gaussian_kde which does everything
the same way except for how it is constructed and how it computes its
covariance matrix (i.e. it doesn't, but it does compute the various
other matrices it needs).
Good luck,
Anne
More information about the SciPy-user
mailing list