[SciPy-Dev] Expanding Scipy's KDE functionality
Thu Jan 24 10:49:21 CST 2013
On Thu, Jan 24, 2013 at 11:41 AM, Daniel Smith
> <josef.pktd <at> gmail.com> writes:
>> On Thu, Jan 24, 2013 at 9:49 AM, Daniel Smith <smith.daniel.br <at> gmail.com> wrote:
>> > Ok, let's see if I can respond to everyone's comments.
>> > >From Jake:
>> >> 2) The algorithm seems limited to one or maybe two dimensions.
>> >> scipy.stats.gaussian_kde is designed for N dimensions, so it might be
>> >> difficult to find a fit for this bandwidth selection method. One option
>> >> might be to allow this bandwidth selection method via a flag in
>> >> scipy.stats.gaussian_kde, and raise an error if the dimensionality is
>> >> too high. To do that, your code would need to be reworked fairly
>> >> extensively to fit in the gaussian_kde class.
>> > In principal, this method can be applied in N dimensions. However, I
>> > think it would be unwise to do so. The method requires that you
>> > simultaneously estimate the density and the bandwidth. Because of
>> > that, you have to implement the method on some mesh, and mesh size
>> > grows exponentially with the number of dimensions. The code certainly
>> > could be reworked to work in N dimensions, but I don't think it would
>> > be effective enough to be worth the effort. The results are also
>> > primarily used for visualization, which is useless beyond 2-d.
>> > >From Ralf:
>> >> My impression is that this can't be integrated with gaussian_kde - it's not a bandwidth estimation
>> method but an adaptive density estimator.
>> > It's both. The bandwidth estimate falls out of the density estimate.
>> > That bandwidth estimate could be easily used to generate an estimate
>> > on a different mesh.
>> >> My suggestion would be to implement the density estimator and do a good amount of
> performance testing, at
>> least show that the performance is as good as described in table 1 of the paper.
>> > I can certainly do that. I will post here when the tests are up and running.
>> > >From Barbier de Reuille Pierre:
>> >> It should be easy to separate them and use the estimation of the bandwidth without the density
>> > Unfortunately, that is not the case. The bandwidth estimate is
>> > generated from a fixed point calculation based on the norm of a
>> > derivative of the estimated density. Unless I am missing something, it
>> > would not be possible to estimate that derivative without an explicit
>> > density estimate. Fourier coordinates are used because the derivative
>> > estimate is simpler in those coordinates.
>> >> For example, as stated in the paper, the method is equivalent to a reflexion method with a
>> kernel. But the renormalisation method gives very similar results too, without enforcing
>> >> that f'(0) = 0 (i.e. first derivative is always 0 on the boundaries).
>> > I have not currently implemented any boundary corrections, but it
>> > would not be difficult to implement the renormalization method using
>> > the bandwidth estimate from this method. It would require a second
>> > density estimate, but the estimate would be much, much better than the
>> > current code.
>> >> Also, can you generalise the bandwidth calculation to unbounded domains? or at least half-
>> (i.e. [a; oo[ or ]-oo; a])? It seems that it all depends on the domain being of finite size.
>> > In fact, the method currently only works on unbounded domains. The
>> > exact domain you calculate the density on is an optional parameter to
>> > the density estimator function. The actual domain you calculate on has
>> > to be finite because a finite mesh is needed.
>> >> I have a different concern though: is it normal that the density returned by the method is not
>> (i.e. the integral of the density is far from being one).
>> > That's a bug. I can fix that with one line of code. I have always just
>> > used the density estimate without units, because they aren't
>> > particularly informative. However, the output should be normalized, or
>> > at least a flag included to make it so.
>> > It seems like the next step is to set up a testing regime for
>> > comparison to the two existing methods to compare speed and reproduce
>> > the data from Table 1 in the paper. Also, it seems likely that
>> > statsmodels is the more appropriate setting for this project. In
>> > particular, I want to generalize the method to periodic domains, which
>> > appears to be a novel implementation so more intensive testing will
>> > likely be needed.
>> Related to this part:
>> I would like to have in statsmodels a collection of commonly used
>> examples processes, and I'd like to add the Marron, Wand examples
> Do you have existing code in statsmodels for this? If I'm already
> writing up such a thing for testing, it's worth
> my time to integrate it into statsmodels. A lot of things are already
> in np.random, but I could
> extend that in statsmodels with the examples from the Botev paper
> and those from the Marron
> and Wand paper.
Nothing yet, I have used a few examples before to try out things, but
they are spread over some uncommitted scripts. The idea to add them
more systematically for reuse is only recent.
>> For kernel regression, I started with this during the nonparametric merge review
>> (some of those examples show that also for kernel regression we need
>> adaptive bandwidth in cases with uneven "smoothness")
>> I had looked at the Marron Wand examples before, but IIRC, it was for
>> either orthogonal series or spline estimation.
>> Statsmodels is using cython to do the binning for the fft based kde,
>> but I never checked how much the speed gain is compared to
>> np.histogram for example. (Skipper's work)
>> > Thanks,
>> > Daniel
>> > _______________________________________________
>> > SciPy-Dev mailing list
>> > SciPy-Dev <at> scipy.org
>> > http://mail.scipy.org/mailman/listinfo/scipy-dev
> SciPy-Dev mailing list
More information about the SciPy-Dev