[SciPy-Dev] Expanding Scipy's KDE functionality
Thu Jan 24 08:49:39 CST 2013
Ok, let's see if I can respond to everyone's comments.
> 2) The algorithm seems limited to one or maybe two dimensions.
> scipy.stats.gaussian_kde is designed for N dimensions, so it might be
> difficult to find a fit for this bandwidth selection method. One option
> might be to allow this bandwidth selection method via a flag in
> scipy.stats.gaussian_kde, and raise an error if the dimensionality is
> too high. To do that, your code would need to be reworked fairly
> extensively to fit in the gaussian_kde class.
In principal, this method can be applied in N dimensions. However, I
think it would be unwise to do so. The method requires that you
simultaneously estimate the density and the bandwidth. Because of
that, you have to implement the method on some mesh, and mesh size
grows exponentially with the number of dimensions. The code certainly
could be reworked to work in N dimensions, but I don't think it would
be effective enough to be worth the effort. The results are also
primarily used for visualization, which is useless beyond 2-d.
> My impression is that this can't be integrated with gaussian_kde - it's not a bandwidth estimation method but an adaptive density estimator.
It's both. The bandwidth estimate falls out of the density estimate.
That bandwidth estimate could be easily used to generate an estimate
on a different mesh.
> My suggestion would be to implement the density estimator and do a good amount of performance testing, at least show that the performance is as good as described in table 1 of the paper.
I can certainly do that. I will post here when the tests are up and running.
>From Barbier de Reuille Pierre:
> It should be easy to separate them and use the estimation of the bandwidth without the density estimation.
Unfortunately, that is not the case. The bandwidth estimate is
generated from a fixed point calculation based on the norm of a
derivative of the estimated density. Unless I am missing something, it
would not be possible to estimate that derivative without an explicit
density estimate. Fourier coordinates are used because the derivative
estimate is simpler in those coordinates.
> For example, as stated in the paper, the method is equivalent to a reflexion method with a gaussian kernel. But the renormalisation method gives very similar results too, without enforcing
> that f'(0) = 0 (i.e. first derivative is always 0 on the boundaries).
I have not currently implemented any boundary corrections, but it
would not be difficult to implement the renormalization method using
the bandwidth estimate from this method. It would require a second
density estimate, but the estimate would be much, much better than the
> Also, can you generalise the bandwidth calculation to unbounded domains? or at least half-domains (i.e. [a; oo[ or ]-oo; a])? It seems that it all depends on the domain being of finite size.
In fact, the method currently only works on unbounded domains. The
exact domain you calculate the density on is an optional parameter to
the density estimator function. The actual domain you calculate on has
to be finite because a finite mesh is needed.
> I have a different concern though: is it normal that the density returned by the method is not normalized? (i.e. the integral of the density is far from being one).
That's a bug. I can fix that with one line of code. I have always just
used the density estimate without units, because they aren't
particularly informative. However, the output should be normalized, or
at least a flag included to make it so.
It seems like the next step is to set up a testing regime for
comparison to the two existing methods to compare speed and reproduce
the data from Table 1 in the paper. Also, it seems likely that
statsmodels is the more appropriate setting for this project. In
particular, I want to generalize the method to periodic domains, which
appears to be a novel implementation so more intensive testing will
likely be needed.
More information about the SciPy-Dev