[SciPy-Dev] scipy.stats.kde

Anne Archibald aarchiba@physics.mcgill...
Fri Aug 27 14:51:50 CDT 2010


On 27 August 2010 15:27, Sam Birch <sam.m.birch@gmail.com> wrote:
>> Bandwidth selection is a hotly debated topic, at least in one
>
> dimension, so perhaps not just different methods but tools for
>
> diagnosing bandwidth selection problems would be nice - at the least,
>
> it should be made straightforward to vary the bandwidth (e.g. to plot
>
> the KDE with a range of different bandwidth values).
>
> Well by allowing them to use a custom bandwidth matrix they can vary it
> themselves, no?

Well, in principle, yes. But if the API forces them to construct an
entirely new KDE object to change the bandwidth matrix, and if this
object involves substantial additional data structures (e.g. a kd-tree
holding the data points) this could be cumbersome.

>> At the other end of the spectrum, for very dense KDEs, on the circle I
>
> found it extremely convenient to use Fourier transforms to carry out
>
> the convolution of kernel with points. In particular, I represented
>
> the KDE in terms of its Fourier coefficients, so that an inverse FFT
>
> immediately gave me the KDE evaluated on a grid (or, with some
>
> fiddling, integrated over the bins of a histogram). I don't know
>
> whether this is a useful optimization for KDEs on the line or in
>
> higher dimensions, since there's the problem of wrapping.
>
> That sounds very interesting. Sorry if I'm being dense (or just wrong, or
> both), but do you convolve post-FFT or before? If before why does it make it
> easier?

Again, this is for work on the circle and for fairly dense data sets.
But in principle, the KDE as a function is the convolution of a forest
of delta functions, one per point, with the kernel. The conventional
way to evaluate this function at a point is simply to evaluate the
kernel once per data point and add them up. To evaluate this on a
grid, you evaluate the kernel once per grid point per data point and
add. Naturally this can be expensive.

My idea is that convolution of functions is simply multiplication of
the Fourier transforms of those functions. So instead of storing a
list of data points in my KDE object, I store a representation of the
forest of delta functions in terms of their Fourier coefficients (the
nth Fourier coefficient of a delta function at phase p is exp(2 pi i n
p)). This is necessarily approximate, since I store finitely many
Fourier coefficients, but it's not hard to store "enough". Now when I
want to convolve this forest by a kernel, I simply multiply these
Fourier coefficients by those of the kernel. The easiest "kernel" is
the sinc function, for which I simply truncate the Fourier
coefficients (which is why it's easy to have enough). We actually use
this "kernel" a lot, even though it's not positive everywhere. A
mathematically-better choice is the von Mises distribution, whose PDF
is proportional to exp(k cos(x)) and whose Fourier coefficients can be
written in terms of Bessel functions.

Once you have the Fourier coefficients of the KDE, you can evaluate it
at a point by taking a sum of sinusoids, but the key idea is that you
can evaluate it on a grid by taking an inverse FFT. If you want
integrals over intervals, well, that you just get by integrating
sinusoids over intervals, so there's a messy but easily-derived way to
work out the area in terms of the Fourier coefficients. This too can
be nicely worked out on a grid, by fiddling the Fourier coefficients
and taking an inverse FFT.


To construct the FCs of the forest of delta functions, if I have
photon arrival phases I just take a sum (which can be slow, but this
isn't really time-critical). But it's also perfectly reasonable to
start from a histogram and take an FFT. Crucially, the histogram need
not be a reasonable-looking histogram - you can never have too many
bins, since it's not a problem if all the bin counts are either zero
or one. The only drawback here is that you introduce an error
averaging to half a bin width on each photon arrival phase. But the
KDE provides a check on this too - if your kernel width is much larger
than the width of the input bins, then the errors you introduced
probably don't matter much (leaving aside nasty issues with Moire
patterns in the likely case that your input times were already
binned).

One thing to note here is that once you have the FCs, you can try
various kernels and bandwidths without going back to your original
data. (You can also get uncertainties on all the various computed
quantities, and in fact you can usually turn around and not only start
from a histogram but start from an array of values with uncertainties.
All this stuff is in a paper that's on my back burner right now.)


The thing is, I don't really know how useful all this is for KDEs on a
line or in R^n. The problem is that working with discrete Fourier
coefficients implicitly wraps the KDE around at the ends of the
interval, and it's not clear that this is still worth doing if you're
going to "pad" your region enough that this isn't a problem: the
padding forces you to evaluate at lots of points you don't care about
and use lots more Fourier coefficients than you would otherwise have
to.


Anne

> -Sam
> On Fri, Aug 27, 2010 at 2:48 PM, Anne Archibald <aarchiba@physics.mcgill.ca>
> wrote:
>>
>> My only experience with KDEs has been on the circle, where there seems
>> to be little or no literature and the constraints are rather
>> different.
>>
>> On 27 August 2010 14:38,  <josef.pktd@gmail.com> wrote:
>> > On Fri, Aug 27, 2010 at 2:17 PM, Sam Birch <sam.m.birch@gmail.com>
>> > wrote:
>> >> Hi all,
>> >> I was thinking of renovating the kernel density estimation package
>> >> (although
>> >> no promises; I'm leaving for college tomorrow morning!). I was
>> >> wondering:
>> >> a) whether anyone had started code in that direction
>> >
>> > Mike Crowe wrote code for kernel regression  and Skipper started a 1D
>> > kernel density estimator in scikits.statsmodels, which cover a larger
>> > number of kernels
>> >
>> > I don't think I have seen any higher dimensional kernel density
>> > estimation in python besides scipy.stats.kde. The Gaussian kde in
>> > scipy.stats is targeted to the underlying Fortran code for
>> > multivariate normal cdf.
>> > It's not clear to me what other n-dimensional kdes would require or
>> > whether they would fit well with the current code.
>> >
>> > One extension that Robert also mentioned in the past that it would be
>> > nice to have adaptive kernels, which I also haven't seen in python
>> > yet.
>> >
>> >> b) what people want in it
>> >> I was thinking (as an ideal, not necessarily goal):
>> >> - Support for more than Gaussian kernels (e.g. custom,
>> >> uniform, Epanechnikov, triangular, quartic, cosine, etc.)
>> >> - More options for bandwidth selection (custom bandwidth matrices,
>> >> AMISE
>> >> optimization, cross-validation, etc.)
>> >
>> > definitely yes, I don't think they are even available for 1D yet.
>>
>> Bandwidth selection is a hotly debated topic, at least in one
>> dimension, so perhaps not just different methods but tools for
>> diagnosing bandwidth selection problems would be nice - at the least,
>> it should be made straightforward to vary the bandwidth (e.g. to plot
>> the KDE with a range of different bandwidth values).
>>
>> >> - Assorted conveniences: automatically generate the mesh, limit the
>> >> kernel's
>> >> support for speed
>> >
>> > Using scipy.spatial to limit the number of neighbors in a bounded
>> > support kernel might be a good idea.
>>
>> Simply using it to find the neighbors that need to be used should
>> speed things up. There may also be some shortcuts for
>> unbounded-support kernels (no point adding a Gaussian a hundred sigma
>> away if there's any points nearby).
>>
>> At the other end of the spectrum, for very dense KDEs, on the circle I
>> found it extremely convenient to use Fourier transforms to carry out
>> the convolution of kernel with points. In particular, I represented
>> the KDE in terms of its Fourier coefficients, so that an inverse FFT
>> immediately gave me the KDE evaluated on a grid (or, with some
>> fiddling, integrated over the bins of a histogram). I don't know
>> whether this is a useful optimization for KDEs on the line or in
>> higher dimensions, since there's the problem of wrapping.
>>
>> Anne
>>
>> > (just some thought on the topic)
>> >
>> > Josef
>> >
>> >> So, thoughts anyone? I figure it's better to over-specify and then
>> >> under-produce, so don't hold back.
>> >> Thanks,
>> >> Sam
>> >> _______________________________________________
>> >> SciPy-Dev mailing list
>> >> SciPy-Dev@scipy.org
>> >> http://mail.scipy.org/mailman/listinfo/scipy-dev
>> >>
>> >>
>> > _______________________________________________
>> > SciPy-Dev mailing list
>> > SciPy-Dev@scipy.org
>> > http://mail.scipy.org/mailman/listinfo/scipy-dev
>> >
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>


More information about the SciPy-Dev mailing list