[SciPy-Dev] Expanding Scipy's KDE functionality
Daniel Smith
smith.daniel.br@gmail....
Thu Jan 24 10:41:49 CST 2013
<josef.pktd <at> gmail.com> writes:
>
> On Thu, Jan 24, 2013 at 9:49 AM, Daniel Smith <smith.daniel.br <at> gmail.com> wrote:
> > Ok, let's see if I can respond to everyone's comments.
> >
> > >From Jake:
> >
> >> 2) The algorithm seems limited to one or maybe two dimensions.
> >> scipy.stats.gaussian_kde is designed for N dimensions, so it might be
> >> difficult to find a fit for this bandwidth selection method. One option
> >> might be to allow this bandwidth selection method via a flag in
> >> scipy.stats.gaussian_kde, and raise an error if the dimensionality is
> >> too high. To do that, your code would need to be reworked fairly
> >> extensively to fit in the gaussian_kde class.
> >
> > In principal, this method can be applied in N dimensions. However, I
> > think it would be unwise to do so. The method requires that you
> > simultaneously estimate the density and the bandwidth. Because of
> > that, you have to implement the method on some mesh, and mesh size
> > grows exponentially with the number of dimensions. The code certainly
> > could be reworked to work in N dimensions, but I don't think it would
> > be effective enough to be worth the effort. The results are also
> > primarily used for visualization, which is useless beyond 2-d.
> >
> > >From Ralf:
> >
> >> My impression is that this can't be integrated with gaussian_kde - it's not a bandwidth estimation
> method but an adaptive density estimator.
> >
> > It's both. The bandwidth estimate falls out of the density estimate.
> > That bandwidth estimate could be easily used to generate an estimate
> > on a different mesh.
> >
> >> My suggestion would be to implement the density estimator and do a good amount of
performance testing, at
> least show that the performance is as good as described in table 1 of the paper.
> >
> > I can certainly do that. I will post here when the tests are up and running.
> >
> > >From Barbier de Reuille Pierre:
> >
> >> It should be easy to separate them and use the estimation of the bandwidth without the density
estimation.
> >
> > Unfortunately, that is not the case. The bandwidth estimate is
> > generated from a fixed point calculation based on the norm of a
> > derivative of the estimated density. Unless I am missing something, it
> > would not be possible to estimate that derivative without an explicit
> > density estimate. Fourier coordinates are used because the derivative
> > estimate is simpler in those coordinates.
> >
> >> For example, as stated in the paper, the method is equivalent to a reflexion method with a
gaussian
> kernel. But the renormalisation method gives very similar results too, without enforcing
> >> that f'(0) = 0 (i.e. first derivative is always 0 on the boundaries).
> >
> > I have not currently implemented any boundary corrections, but it
> > would not be difficult to implement the renormalization method using
> > the bandwidth estimate from this method. It would require a second
> > density estimate, but the estimate would be much, much better than the
> > current code.
> >
> >> Also, can you generalise the bandwidth calculation to unbounded domains? or at least half-
domains
> (i.e. [a; oo[ or ]-oo; a])? It seems that it all depends on the domain being of finite size.
> >
> > In fact, the method currently only works on unbounded domains. The
> > exact domain you calculate the density on is an optional parameter to
> > the density estimator function. The actual domain you calculate on has
> > to be finite because a finite mesh is needed.
> >
> >> I have a different concern though: is it normal that the density returned by the method is not
normalized?
> (i.e. the integral of the density is far from being one).
> >
> > That's a bug. I can fix that with one line of code. I have always just
> > used the density estimate without units, because they aren't
> > particularly informative. However, the output should be normalized, or
> > at least a flag included to make it so.
> >
> >
> >
> > It seems like the next step is to set up a testing regime for
> > comparison to the two existing methods to compare speed and reproduce
> > the data from Table 1 in the paper. Also, it seems likely that
> > statsmodels is the more appropriate setting for this project. In
> > particular, I want to generalize the method to periodic domains, which
> > appears to be a novel implementation so more intensive testing will
> > likely be needed.
>
> Related to this part:
>
> I would like to have in statsmodels a collection of commonly used
> examples processes, and I'd like to add the Marron, Wand examples
> there.
Do you have existing code in statsmodels for this? If I'm already
writing up such a thing for testing, it's worth
my time to integrate it into statsmodels. A lot of things are already
in np.random, but I could
extend that in statsmodels with the examples from the Botev paper
and those from the Marron
and Wand paper.
>
> For kernel regression, I started with this during the nonparametric merge review
>
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/nonparametric/dgp_e
xamples.py
> https://groups.google.com/d/topic/pystatsmodels/itS_DyPHLA8/discussion
>
> (some of those examples show that also for kernel regression we need
> adaptive bandwidth in cases with uneven "smoothness")
>
> I had looked at the Marron Wand examples before, but IIRC, it was for
> either orthogonal series or spline estimation.
>
> Statsmodels is using cython to do the binning for the fft based kde,
> but I never checked how much the speed gain is compared to
> np.histogram for example. (Skipper's work)
>
> Josef
>
> >
> > Thanks,
> > Daniel
> > _______________________________________________
> > SciPy-Dev mailing list
> > SciPy-Dev <at> scipy.org
> > http://mail.scipy.org/mailman/listinfo/scipy-dev
>
More information about the SciPy-Dev
mailing list