[SciPy-User] scipy.stats convolution of two distributions

josef.pktd@gmai... josef.pktd@gmai...
Mon Apr 9 17:06:14 CDT 2012


On Mon, Apr 9, 2012 at 5:04 PM, nicky van foreest <vanforeest@gmail.com> wrote:
> Hi,
>
> In one of my projects I built some code that depends in a nice and
> generic way on the methods of rv_continuous in scipy.stats. Now it
> turns out that I shot myself in the foot because I need to run a test
> with the sum (convolution) of three such distributions. As far as I
> can see there is no standard way to achieve this in scipy.stats. Does
> anybody know of a good (generic) way to do this? If not, would it
> actually be useful to add such functionality to scipy.stats, if
> possible at all?
>
> After some struggling I wrote the code below, but this is a first
> attempt. I am very interested in obtaining feedback to turn this into
> something that is useful for a larger population than just me.

Sounds fun.

The plots are a bit misleading, because matplotlib is doing the
interpolation for you

pl.plot(grid,D1.cdf(grid), '.')
pl.plot(grid,conv.cdf(grid), '.')
pl.plot(grid,conv2.cdf(grid), '.')

The main problem is the mixing of continuous distribution and discrete
grid. pdf, cdf, .... want to evaluate at a point and with floating
points it's perilous (as we discussed before).

As I read it, neither your cdf nor pdf return specific points.

I think the easiest would be to work with linear interpolation
interp1d on a fine grid, but I never checked how fast this is for a
fine grid.
If cdf and pdf are defined with a piecewise polynomial, then they can
be evaluated at any points and the generic class, rv_continuous,
should be able to handle everything.
The alternative would be to work with the finite grid and use the
fixed spacing to define a lattice distribution, but that doesn't exist
in scipy.

I haven't thought about the convolution itself yet, (an alternative
would be using fft to work with the characteristic function.)

If you use this for a test, how much do you care about having accurate
tails, or is effectively truncating the distribution ok?

Josef

>
> Thanks in advance
>
> Nicky
>
> import numpy as np
> import scipy.stats
> from scipy.stats import poisson, uniform, expon
> import pylab as pl
>
> # I need a grid since np.convolve requires two arrays.
>
> # choose the grid such that it covers the numerically relevant support
> # of the distributions
> grid = np.arange(0., 5., 0.001)
>
> # I need P(D1+D2+D3 <= x)
> D1 = expon(scale = 1./2)
> D2 = expon(scale = 1./3)
> D3 = expon(scale = 1./6)
>
> class convolved_gen(scipy.stats.rv_continuous):
>    def __init__(self, D1, D2, grid):
>        self.D1 = D1
>        self.D2 = D2
>        delta = grid[1]-grid[0]
>        p1 = self.D1.pdf(grid)
>        p2 = self.D2.pdf(grid)*delta
>        self.conv = np.convolve(p1, p2)
>
>        super(convolved_gen, self).__init__(name = "convolved")
>
>    def _cdf(self, grid):
>        cdf = np.cumsum(self.conv)
>        return cdf/cdf[-1] # ensure that cdf[-1] = 1
>
>    def _pdf(self, grid):
>        return self.conv[:len(grid)]
>
>    def _stats(self):
>        m = self.D1.stats("m") + self.D2.stats("m")
>        v = self.D1.stats("v") + self.D2.stats("v")
>        return m, v, 0., 0.
>
>
>
> convolved = convolved_gen(D1, D2, grid)
> conv = convolved()
>
> convolved2 = convolved_gen(conv, D3, grid)
> conv2 = convolved2()
>
> pl.plot(grid,D1.cdf(grid))
> pl.plot(grid,conv.cdf(grid))
> pl.plot(grid,conv2.cdf(grid))
> pl.show()
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user


More information about the SciPy-User mailing list