[SciPy-User] scipy.stats convolution of two distributions

nicky van foreest vanforeest@gmail....
Tue Apr 10 05:07:23 CDT 2012


Hi Josef,

Thanks for all your answers. This was more than was hoping for. I
think I have to do some serious studying before I can deal with all
your feedback. As a start I found the following article, which seems
(telling from the abstract) a good starting point. From your list of
mails above I'll also try to make a list of requirements that a
convolution operator on probability distributions should offer.

On computing the distribution function of the sum of independentrandomvariables
Mani K. Agrawal*, a,  [Author Vitae], Salah E. Elmaghraby*, b, ,  [Author Vitae]
a Manugistics, Inc., 9200 E. Panorama Circle, Englewood, CO 80112, USA
b Department of Industrial Engineering, College of Engineering, North
Carolina State University, Campus Box 7906, Raleigh, NC 27695-7906,
USA
Received 1 August 1998. Revised 1 July 1999. Available online 4 December 2000.
http://dx.doi.org/10.1016/S0305-0548(99)00133-1, How to Cite or Link Using DOI
Cited by in Scopus (7)
Permissions & Reprints


On 10 April 2012 03:09,  <josef.pktd@gmail.com> wrote:
> On Mon, Apr 9, 2012 at 8:45 PM,  <josef.pktd@gmail.com> wrote:
>> On Mon, Apr 9, 2012 at 6:36 PM,  <josef.pktd@gmail.com> wrote:
>>> On Mon, Apr 9, 2012 at 6:06 PM,  <josef.pktd@gmail.com> wrote:
>>>> On Mon, Apr 9, 2012 at 5:04 PM, nicky van foreest <vanforeest@gmail.com> wrote:
>>>>> Hi,
>>>>>
>>>>> In one of my projects I built some code that depends in a nice and
>>>>> generic way on the methods of rv_continuous in scipy.stats. Now it
>>>>> turns out that I shot myself in the foot because I need to run a test
>>>>> with the sum (convolution) of three such distributions. As far as I
>>>>> can see there is no standard way to achieve this in scipy.stats. Does
>>>>> anybody know of a good (generic) way to do this? If not, would it
>>>>> actually be useful to add such functionality to scipy.stats, if
>>>>> possible at all?
>>>>>
>>>>> After some struggling I wrote the code below, but this is a first
>>>>> attempt. I am very interested in obtaining feedback to turn this into
>>>>> something that is useful for a larger population than just me.
>>>>
>>>> Sounds fun.
>>>>
>>>> The plots are a bit misleading, because matplotlib is doing the
>>>> interpolation for you
>>>>
>>>> pl.plot(grid,D1.cdf(grid), '.')
>>>> pl.plot(grid,conv.cdf(grid), '.')
>>>> pl.plot(grid,conv2.cdf(grid), '.')
>>>>
>>>> The main problem is the mixing of continuous distribution and discrete
>>>> grid. pdf, cdf, .... want to evaluate at a point and with floating
>>>> points it's perilous (as we discussed before).
>>>>
>>>> As I read it, neither your cdf nor pdf return specific points.
>>>>
>>>> I think the easiest would be to work with linear interpolation
>>>> interp1d on a fine grid, but I never checked how fast this is for a
>>>> fine grid.
>>>> If cdf and pdf are defined with a piecewise polynomial, then they can
>>>> be evaluated at any points and the generic class, rv_continuous,
>>>> should be able to handle everything.
>>>> The alternative would be to work with the finite grid and use the
>>>> fixed spacing to define a lattice distribution, but that doesn't exist
>>>> in scipy.
>>>>
>>>> I haven't thought about the convolution itself yet, (an alternative
>>>> would be using fft to work with the characteristic function.)
>>>>
>>>> If you use this for a test, how much do you care about having accurate
>>>> tails, or is effectively truncating the distribution ok?
>>>
>>> The other question I thought about in a similar situation is, what is
>>> the usage or access pattern.
>>>
>>> For many cases, I'm not really interested in evaluating the pdf at
>>> specific points, but over a range or interval of points. In this case
>>> relying on floating point access doesn't look like the best way to go,
>>> and I spend more time on the `expect` method. Calculating expectation
>>> of a function w.r.t the distribution that can use the internal
>>> representation instead of the generic integrate.quad.
>>
>>
>> a test case to check numerical accuracy of discretized approximation/convolution
>>
>> Di = expon(scale = 1./2)
>> sum of identical exponentially distributed random variables is gamma
>> http://en.wikipedia.org/wiki/Gamma_distribution
>>
>> 1 exponential and corresponding gamma
>>
>>>>> convolved.D1.pdf(grid[:10])
>> array([ 2.        ,  1.996004  ,  1.99201598,  1.98803593,  1.98406383,
>>        1.98009967,  1.97614343,  1.97219509,  1.96825464,  1.96432206])
>>>>> stats.gamma.pdf(grid[:10], 1, scale=1/2.)
>> array([ 2.        ,  1.996004  ,  1.99201598,  1.98803593,  1.98406383,
>>        1.98009967,  1.97614343,  1.97219509,  1.96825464,  1.96432206])
>>
>> sum of 2 exponentials
>>
>>>>> stats.gamma.pdf(grid[:10], 2, scale=1/2.)
>> array([ 0.        ,  0.00399201,  0.00796806,  0.01192822,  0.01587251,
>>        0.019801  ,  0.02371372,  0.02761073,  0.03149207,  0.0353578 ])
>>>>> convolved.pdf(np.zeros(10))
>> array([ 0.004     ,  0.00798402,  0.0119521 ,  0.01590429,  0.01984064,
>>        0.0237612 ,  0.02766601,  0.03155512,  0.03542858,  0.03928644])
>>>>> stats.gamma.pdf(grid[1:21], 2, scale=1/2.) - convolved.pdf(np.zeros(20))
>> array([ -7.99200533e-06,  -1.59520746e-05,  -2.38803035e-05,
>>        -3.17767875e-05,  -3.96416218e-05,  -4.74749013e-05,
>>        -5.52767208e-05,  -6.30471746e-05,  -7.07863571e-05,
>>        -7.84943621e-05,  -8.61712832e-05,  -9.38172141e-05,
>>        -1.01432248e-04,  -1.09016477e-04,  -1.16569995e-04,
>>        -1.24092894e-04,  -1.31585266e-04,  -1.39047203e-04,
>>        -1.46478797e-04,  -1.53880139e-04])
>>
>> sum of 3 exponentials
>>
>>>>> convolved2.conv[:2000:100]
>> array([  8.00000000e-06,   3.37382569e-02,   1.08865338e-01,
>>         1.99552301e-01,   2.89730911e-01,   3.70089661e-01,
>>         4.35890673e-01,   4.85403437e-01,   5.18794908e-01,
>>         5.37354948e-01,   5.42966239e-01,   5.37750775e-01,
>>         5.23842475e-01,   5.03248651e-01,   4.77772987e-01,
>>         4.48980181e-01,   4.18187929e-01,   3.86476082e-01,
>>         3.54705854e-01,   3.23544178e-01])
>>>>> stats.gamma.pdf(grid[1:2001:100], 3, scale=1/2.)
>> array([  3.99200799e-06,   3.33407414e-02,   1.08109964e-01,
>>         1.98494147e-01,   2.88432744e-01,   3.68614464e-01,
>>         4.34297139e-01,   4.83743524e-01,   5.17112771e-01,
>>         5.35686765e-01,   5.41340592e-01,   5.36189346e-01,
>>         5.22360899e-01,   5.01857412e-01,   4.76478297e-01,
>>         4.47784794e-01,   4.17091870e-01,   3.85477285e-01,
>>         3.53800704e-01,   3.22727969e-01])
>>>>> stats.gamma.pdf(grid[1:2001:100], 3, scale=1/2.) - convolved2.conv[:2000:100]
>> array([ -4.00799201e-06,  -3.97515433e-04,  -7.55373610e-04,
>>        -1.05815476e-03,  -1.29816640e-03,  -1.47519705e-03,
>>        -1.59353434e-03,  -1.65991307e-03,  -1.68213690e-03,
>>        -1.66818281e-03,  -1.62564706e-03,  -1.56142889e-03,
>>        -1.48157626e-03,  -1.39123891e-03,  -1.29468978e-03,
>>        -1.19538731e-03,  -1.09605963e-03,  -9.98797767e-04,
>>        -9.05149579e-04,  -8.16209174e-04])
>>
>> values shifted by one ?
>>
>>>>> np.argmax(stats.gamma.pdf(grid, 3, scale=1/2.))
>> 1000
>>
>>>>> np.argmax(convolved2.conv)
>> 999
>
>
> on the other hand, the convolution with 3 distribution looks accurate
> at around 1e-4, so depending on the application this works pretty well
>
>>>> stats.gamma.cdf(grid[1:2001:100], 3, scale=1/2.) - convolved2.cdf(np.zeros(2000))[:2000:100]
> array([ -6.64904892e-09,  -3.41833566e-05,  -1.12776463e-04,
>        -2.11506843e-04,  -3.14716605e-04,  -4.12809581e-04,
>        -5.00378449e-04,  -5.74846082e-04,  -6.35489967e-04,
>        -6.82750381e-04,  -7.17747313e-04,  -7.41949767e-04,
>        -7.56955276e-04,  -7.64348246e-04,  -7.65613908e-04,
>        -7.62090849e-04,  -7.54949690e-04,  -7.45188971e-04,
>        -7.33641860e-04,  -7.20989200e-04])
>
> for 2 distributions
>
>>>> np.max(np.abs(stats.gamma.cdf(grid[1:2001], 2, scale=1/2.) - convolved.cdf(np.zeros(2000))[:2000]))
> 0.00039327739646649595
>
> Josef
>
>>
>> Josef
>>
>>>
>>> Josef
>>>
>>>
>>>>
>>>> Josef
>>>>
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>> Nicky
>>>>>
>>>>> import numpy as np
>>>>> import scipy.stats
>>>>> from scipy.stats import poisson, uniform, expon
>>>>> import pylab as pl
>>>>>
>>>>> # I need a grid since np.convolve requires two arrays.
>>>>>
>>>>> # choose the grid such that it covers the numerically relevant support
>>>>> # of the distributions
>>>>> grid = np.arange(0., 5., 0.001)
>>>>>
>>>>> # I need P(D1+D2+D3 <= x)
>>>>> D1 = expon(scale = 1./2)
>>>>> D2 = expon(scale = 1./3)
>>>>> D3 = expon(scale = 1./6)
>>>>>
>>>>> class convolved_gen(scipy.stats.rv_continuous):
>>>>>    def __init__(self, D1, D2, grid):
>>>>>        self.D1 = D1
>>>>>        self.D2 = D2
>>>>>        delta = grid[1]-grid[0]
>>>>>        p1 = self.D1.pdf(grid)
>>>>>        p2 = self.D2.pdf(grid)*delta
>>>>>        self.conv = np.convolve(p1, p2)
>>>>>
>>>>>        super(convolved_gen, self).__init__(name = "convolved")
>>>>>
>>>>>    def _cdf(self, grid):
>>>>>        cdf = np.cumsum(self.conv)
>>>>>        return cdf/cdf[-1] # ensure that cdf[-1] = 1
>>>>>
>>>>>    def _pdf(self, grid):
>>>>>        return self.conv[:len(grid)]
>>>>>
>>>>>    def _stats(self):
>>>>>        m = self.D1.stats("m") + self.D2.stats("m")
>>>>>        v = self.D1.stats("v") + self.D2.stats("v")
>>>>>        return m, v, 0., 0.
>>>>>
>>>>>
>>>>>
>>>>> convolved = convolved_gen(D1, D2, grid)
>>>>> conv = convolved()
>>>>>
>>>>> convolved2 = convolved_gen(conv, D3, grid)
>>>>> conv2 = convolved2()
>>>>>
>>>>> pl.plot(grid,D1.cdf(grid))
>>>>> pl.plot(grid,conv.cdf(grid))
>>>>> pl.plot(grid,conv2.cdf(grid))
>>>>> pl.show()
>>>>> _______________________________________________
>>>>> SciPy-User mailing list
>>>>> SciPy-User@scipy.org
>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user


More information about the SciPy-User mailing list