[SciPy-Dev] Expanding Scipy's KDE functionality
Wed Jan 23 16:41:27 CST 2013
On Wed, Jan 23, 2013 at 9:30 PM, Jake Vanderplas <
> Hi Daniel,
> That looks like a nice implementation. My concern about adding it to
> scipy is twofold:
> 1) Is this a well-known and well-proven technique, or is it more
> cutting-edge? My view is that scipy should not seek to implement every
> cutting-edge algorithm: in the long-run this will lead to code bloat and
> difficulty of maintenance. If that's the case, your code might be a
> better fit for statsmodels or another more specialized package.
It seems to be a new technique, however the paper has already 150
citations. The algorithm also seems to be straightforward to implement, so
I think it could be put into scipy.stats. statsmodels.nonparametric would
also be a good place for it though.
> 2) The algorithm seems limited to one or maybe two dimensions.
> scipy.stats.gaussian_kde is designed for N dimensions, so it might be
> difficult to find a fit for this bandwidth selection method. One option
> might be to allow this bandwidth selection method via a flag in
> scipy.stats.gaussian_kde, and raise an error if the dimensionality is
> too high. To do that, your code would need to be reworked fairly
> extensively to fit in the gaussian_kde class.
> I'd like other devs to weigh-in about the algorithm, especially my
> concern #1, before any work starts on a scipy PR.
I quickly browsed the paper and original (BSD-licensed) code. My impression
is that this can't be integrated with gaussian_kde - it's not a bandwidth
estimation method but an adaptive density estimator. The method is only
1-D, but will handle especially multimodal distributions much better than
My suggestion would be to implement the density estimator and do a good
amount of performance testing, at least show that the performance is as
good as described in table 1 of the paper. Then we can still decide where
to put it.
> On 01/23/2013 12:11 PM, Daniel Smith wrote:
> > Hello,
> > This was started on a different thread, but I thought I would post a
> > new thread focused on this. Currently, I have some existing code that
> > implements the bandwidth selection algorithm from:
> > Z. I. Botev, J. F. Grotowski, and D. P. Kroese. Kernel density
> > estimation via diffusion. The Annals of Statistics, 38(5):2916-2957,
> > 2010.
> > Zdravko Botev implemented the code in MatLab which can be found here:
> > My code for that is here:
> > https://github.com/Daniel-B-Smith/KDE-for-SciPy
> > I assume I probably need to find a workaround to avoid the float128 in
> > the function fixed_point before I can add it to SciPy. I wrote the
> > code a couple of years ago, so it will take me a moment to map out the
> > best workaround (there is a very large number being multiplied by a
> > very small number). I can also add the 2d-version once I start
> > integrating with SciPy. I have a couple of questions remaining. First,
> > should I implement this in SciPy? StatsModels? Both? Secondly, can I
> > use Cython to generate C code for the function fixed_point? Or do I
> > need to write it up in the Numpy C API?
> > If there is somewhere else I should post this and/or someone I should
> > directly contact, I would greatly appreciate it.
> > Thanks,
> > Daniel
> > _______________________________________________
> > SciPy-Dev mailing list
> > SciPy-Dev@scipy.org
> > http://mail.scipy.org/mailman/listinfo/scipy-dev
> SciPy-Dev mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the SciPy-Dev