[SciPy-Dev] Expanding Scipy's KDE functionality

Pierre Barbier de Reuille pierre@barbierdereuille....
Thu Jan 24 06:52:27 CST 2013


Hi,

I am not a developer of SciPy per se, but I am currently looking into these
KDE methods with bounded domains.

First, it looks like there are two parts with the algorithm:
 1 - the estimation of the bandwidth
 2 - the estimation of the density

It should be easy to separate them and use the estimation of the bandwidth
without the density estimation. This would be interesting as it seems that
the density needs to be estimated on a regular mesh. It also allows
comparison of the method with other estimators. For example, as stated in
the paper, the method is equivalent to a reflexion method with a gaussian
kernel. But the renormalisation method gives very similar results too,
without enforcing that f'(0) = 0 (i.e. first derivative is always 0 on the
boundaries).

I have a different concern though: is it normal that the density returned
by the method is not normalized? (i.e. the integral of the density is far
from being one).

Also, can you generalise the bandwidth calculation to unbounded domains? or
at least half-domains (i.e. [a; oo[ or ]-oo; a])? It seems that it all
depends on the domain being of finite size.

-- 
Barbier de Reuille Pierre


On 23 January 2013 23:41, Ralf Gommers <ralf.gommers@gmail.com> wrote:

>
>
>
> On Wed, Jan 23, 2013 at 9:30 PM, Jake Vanderplas <
> vanderplas@astro.washington.edu> wrote:
>
>> Hi Daniel,
>> That looks like a nice implementation.  My concern about adding it to
>> scipy is twofold:
>>
>> 1) Is this a well-known and well-proven technique, or is it more
>> cutting-edge?  My view is that scipy should not seek to implement every
>> cutting-edge algorithm: in the long-run this will lead to code bloat and
>> difficulty of maintenance.  If that's the case, your code might be a
>> better fit for statsmodels or another more specialized package.
>>
>
> It seems to be a new technique, however the paper has already 150
> citations. The algorithm also seems to be straightforward to implement, so
> I think it could be put into scipy.stats. statsmodels.nonparametric would
> also be a good place for it though.
>
>
>>
>> 2) The algorithm seems limited to one or maybe two dimensions.
>> scipy.stats.gaussian_kde is designed for N dimensions, so it might be
>> difficult to find a fit for this bandwidth selection method. One option
>> might be to allow this bandwidth selection method via a flag in
>> scipy.stats.gaussian_kde, and raise an error if the dimensionality is
>> too high.  To do that, your code would need to be reworked fairly
>> extensively to fit in the gaussian_kde class.
>>
>> I'd like other devs to weigh-in about the algorithm, especially my
>> concern #1, before any work starts on a scipy PR.
>>
>
> I quickly browsed the paper and original (BSD-licensed) code. My
> impression is that this can't be integrated with gaussian_kde - it's not a
> bandwidth estimation method but an adaptive density estimator. The method
> is only 1-D, but will handle especially multimodal distributions much
> better than gaussian_kde.
>
> My suggestion would be to implement the density estimator and do a good
> amount of performance testing, at least show that the performance is as
> good as described in table 1 of the paper. Then we can still decide where
> to put it.
>
> Ralf
>
>
>
>> On 01/23/2013 12:11 PM, Daniel Smith wrote:
>> > Hello,
>> >
>> > This was started on a different thread, but I thought I would post a
>> > new thread focused on this. Currently, I have some existing code that
>> > implements the bandwidth selection algorithm from:
>> >
>> > Z. I. Botev, J. F. Grotowski, and D. P. Kroese. Kernel density
>> > estimation via diffusion. The Annals of Statistics, 38(5):2916-2957,
>> > 2010.
>> >
>> > Zdravko Botev implemented the code in MatLab which can be found here:
>> >
>> >
>> http://www.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-estimator
>> >
>> > My code for that is here:
>> >
>> > https://github.com/Daniel-B-Smith/KDE-for-SciPy
>> >
>> > I assume I probably need to find a workaround to avoid the float128 in
>> > the function fixed_point before I can add it to SciPy. I wrote the
>> > code a couple of years ago, so it will take me a moment to map out the
>> > best workaround (there is a very large number being multiplied by a
>> > very small number). I can also add the 2d-version once I start
>> > integrating with SciPy. I have a couple of questions remaining. First,
>> > should I implement this in SciPy? StatsModels? Both? Secondly, can I
>> > use Cython to generate C code for the function fixed_point? Or do I
>> > need to write it up in the Numpy C API?
>> >
>> > If there is somewhere else I should post this and/or someone I should
>> > directly contact, I would greatly appreciate it.
>> >
>> > Thanks,
>> > Daniel
>> > _______________________________________________
>> > SciPy-Dev mailing list
>> > SciPy-Dev@scipy.org
>> > http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-dev/attachments/20130124/deed23df/attachment.html 


More information about the SciPy-Dev mailing list