[SciPy-User] scipy.stats convolution of two distributions
josef.pktd@gmai...
josef.pktd@gmai...
Tue Apr 10 09:56:10 CDT 2012
On Tue, Apr 10, 2012 at 6:07 AM, nicky van foreest <vanforeest@gmail.com> wrote:
> Hi Josef,
>
> Thanks for all your answers. This was more than was hoping for. I
> think I have to do some serious studying before I can deal with all
> your feedback. As a start I found the following article, which seems
> (telling from the abstract) a good starting point. From your list of
> mails above I'll also try to make a list of requirements that a
> convolution operator on probability distributions should offer.
I think how "fancy" you want to get will depend a lot on what you want
to use it for. For example, I was reading the literature using
characteristic function for applications where the tail probabilities
should be accurate. Tails are difficult to get to a good precision,
and there are many papers with various tricks, none of which I tried
to implement.
In the example, if I increase the truncation to 10, grid =
np.arange(0., 10., 0.001), 10000 points which is still fast with
convolution, then the percent or absolute difference to the true
distribution in the gamma case doesn't look so bad.
mean and variance
>>> convolved2.stats()
(array(1.5), array(0.75))
>>> (grid[1:10000] * convolved2.conv[1:10000]).sum()*0.001
1.502997188928352
>>> ((grid[1:10000] - 1.5)**2 * convolved2.conv[1:10000]).sum()*0.001
0.7522175059430164
>>> ((grid[1:10000] - 1.502997188928352)**2 * convolved2.conv[1:10000]).sum()*0.001
0.7522355563037656
(I didn't try to figure out the convolution boundaries.)
Josef
>
> On computing the distribution function of the sum of independentrandomvariables
> Mani K. Agrawal*, a, [Author Vitae], Salah E. Elmaghraby*, b, , [Author Vitae]
> a Manugistics, Inc., 9200 E. Panorama Circle, Englewood, CO 80112, USA
> b Department of Industrial Engineering, College of Engineering, North
> Carolina State University, Campus Box 7906, Raleigh, NC 27695-7906,
> USA
> Received 1 August 1998. Revised 1 July 1999. Available online 4 December 2000.
> http://dx.doi.org/10.1016/S0305-0548(99)00133-1, How to Cite or Link Using DOI
> Cited by in Scopus (7)
> Permissions & Reprints
>
>
> On 10 April 2012 03:09, <josef.pktd@gmail.com> wrote:
>> On Mon, Apr 9, 2012 at 8:45 PM, <josef.pktd@gmail.com> wrote:
>>> On Mon, Apr 9, 2012 at 6:36 PM, <josef.pktd@gmail.com> wrote:
>>>> On Mon, Apr 9, 2012 at 6:06 PM, <josef.pktd@gmail.com> wrote:
>>>>> On Mon, Apr 9, 2012 at 5:04 PM, nicky van foreest <vanforeest@gmail.com> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> In one of my projects I built some code that depends in a nice and
>>>>>> generic way on the methods of rv_continuous in scipy.stats. Now it
>>>>>> turns out that I shot myself in the foot because I need to run a test
>>>>>> with the sum (convolution) of three such distributions. As far as I
>>>>>> can see there is no standard way to achieve this in scipy.stats. Does
>>>>>> anybody know of a good (generic) way to do this? If not, would it
>>>>>> actually be useful to add such functionality to scipy.stats, if
>>>>>> possible at all?
>>>>>>
>>>>>> After some struggling I wrote the code below, but this is a first
>>>>>> attempt. I am very interested in obtaining feedback to turn this into
>>>>>> something that is useful for a larger population than just me.
>>>>>
>>>>> Sounds fun.
>>>>>
>>>>> The plots are a bit misleading, because matplotlib is doing the
>>>>> interpolation for you
>>>>>
>>>>> pl.plot(grid,D1.cdf(grid), '.')
>>>>> pl.plot(grid,conv.cdf(grid), '.')
>>>>> pl.plot(grid,conv2.cdf(grid), '.')
>>>>>
>>>>> The main problem is the mixing of continuous distribution and discrete
>>>>> grid. pdf, cdf, .... want to evaluate at a point and with floating
>>>>> points it's perilous (as we discussed before).
>>>>>
>>>>> As I read it, neither your cdf nor pdf return specific points.
>>>>>
>>>>> I think the easiest would be to work with linear interpolation
>>>>> interp1d on a fine grid, but I never checked how fast this is for a
>>>>> fine grid.
>>>>> If cdf and pdf are defined with a piecewise polynomial, then they can
>>>>> be evaluated at any points and the generic class, rv_continuous,
>>>>> should be able to handle everything.
>>>>> The alternative would be to work with the finite grid and use the
>>>>> fixed spacing to define a lattice distribution, but that doesn't exist
>>>>> in scipy.
>>>>>
>>>>> I haven't thought about the convolution itself yet, (an alternative
>>>>> would be using fft to work with the characteristic function.)
>>>>>
>>>>> If you use this for a test, how much do you care about having accurate
>>>>> tails, or is effectively truncating the distribution ok?
>>>>
>>>> The other question I thought about in a similar situation is, what is
>>>> the usage or access pattern.
>>>>
>>>> For many cases, I'm not really interested in evaluating the pdf at
>>>> specific points, but over a range or interval of points. In this case
>>>> relying on floating point access doesn't look like the best way to go,
>>>> and I spend more time on the `expect` method. Calculating expectation
>>>> of a function w.r.t the distribution that can use the internal
>>>> representation instead of the generic integrate.quad.
>>>
>>>
>>> a test case to check numerical accuracy of discretized approximation/convolution
>>>
>>> Di = expon(scale = 1./2)
>>> sum of identical exponentially distributed random variables is gamma
>>> http://en.wikipedia.org/wiki/Gamma_distribution
>>>
>>> 1 exponential and corresponding gamma
>>>
>>>>>> convolved.D1.pdf(grid[:10])
>>> array([ 2. , 1.996004 , 1.99201598, 1.98803593, 1.98406383,
>>> 1.98009967, 1.97614343, 1.97219509, 1.96825464, 1.96432206])
>>>>>> stats.gamma.pdf(grid[:10], 1, scale=1/2.)
>>> array([ 2. , 1.996004 , 1.99201598, 1.98803593, 1.98406383,
>>> 1.98009967, 1.97614343, 1.97219509, 1.96825464, 1.96432206])
>>>
>>> sum of 2 exponentials
>>>
>>>>>> stats.gamma.pdf(grid[:10], 2, scale=1/2.)
>>> array([ 0. , 0.00399201, 0.00796806, 0.01192822, 0.01587251,
>>> 0.019801 , 0.02371372, 0.02761073, 0.03149207, 0.0353578 ])
>>>>>> convolved.pdf(np.zeros(10))
>>> array([ 0.004 , 0.00798402, 0.0119521 , 0.01590429, 0.01984064,
>>> 0.0237612 , 0.02766601, 0.03155512, 0.03542858, 0.03928644])
>>>>>> stats.gamma.pdf(grid[1:21], 2, scale=1/2.) - convolved.pdf(np.zeros(20))
>>> array([ -7.99200533e-06, -1.59520746e-05, -2.38803035e-05,
>>> -3.17767875e-05, -3.96416218e-05, -4.74749013e-05,
>>> -5.52767208e-05, -6.30471746e-05, -7.07863571e-05,
>>> -7.84943621e-05, -8.61712832e-05, -9.38172141e-05,
>>> -1.01432248e-04, -1.09016477e-04, -1.16569995e-04,
>>> -1.24092894e-04, -1.31585266e-04, -1.39047203e-04,
>>> -1.46478797e-04, -1.53880139e-04])
>>>
>>> sum of 3 exponentials
>>>
>>>>>> convolved2.conv[:2000:100]
>>> array([ 8.00000000e-06, 3.37382569e-02, 1.08865338e-01,
>>> 1.99552301e-01, 2.89730911e-01, 3.70089661e-01,
>>> 4.35890673e-01, 4.85403437e-01, 5.18794908e-01,
>>> 5.37354948e-01, 5.42966239e-01, 5.37750775e-01,
>>> 5.23842475e-01, 5.03248651e-01, 4.77772987e-01,
>>> 4.48980181e-01, 4.18187929e-01, 3.86476082e-01,
>>> 3.54705854e-01, 3.23544178e-01])
>>>>>> stats.gamma.pdf(grid[1:2001:100], 3, scale=1/2.)
>>> array([ 3.99200799e-06, 3.33407414e-02, 1.08109964e-01,
>>> 1.98494147e-01, 2.88432744e-01, 3.68614464e-01,
>>> 4.34297139e-01, 4.83743524e-01, 5.17112771e-01,
>>> 5.35686765e-01, 5.41340592e-01, 5.36189346e-01,
>>> 5.22360899e-01, 5.01857412e-01, 4.76478297e-01,
>>> 4.47784794e-01, 4.17091870e-01, 3.85477285e-01,
>>> 3.53800704e-01, 3.22727969e-01])
>>>>>> stats.gamma.pdf(grid[1:2001:100], 3, scale=1/2.) - convolved2.conv[:2000:100]
>>> array([ -4.00799201e-06, -3.97515433e-04, -7.55373610e-04,
>>> -1.05815476e-03, -1.29816640e-03, -1.47519705e-03,
>>> -1.59353434e-03, -1.65991307e-03, -1.68213690e-03,
>>> -1.66818281e-03, -1.62564706e-03, -1.56142889e-03,
>>> -1.48157626e-03, -1.39123891e-03, -1.29468978e-03,
>>> -1.19538731e-03, -1.09605963e-03, -9.98797767e-04,
>>> -9.05149579e-04, -8.16209174e-04])
>>>
>>> values shifted by one ?
>>>
>>>>>> np.argmax(stats.gamma.pdf(grid, 3, scale=1/2.))
>>> 1000
>>>
>>>>>> np.argmax(convolved2.conv)
>>> 999
>>
>>
>> on the other hand, the convolution with 3 distribution looks accurate
>> at around 1e-4, so depending on the application this works pretty well
>>
>>>>> stats.gamma.cdf(grid[1:2001:100], 3, scale=1/2.) - convolved2.cdf(np.zeros(2000))[:2000:100]
>> array([ -6.64904892e-09, -3.41833566e-05, -1.12776463e-04,
>> -2.11506843e-04, -3.14716605e-04, -4.12809581e-04,
>> -5.00378449e-04, -5.74846082e-04, -6.35489967e-04,
>> -6.82750381e-04, -7.17747313e-04, -7.41949767e-04,
>> -7.56955276e-04, -7.64348246e-04, -7.65613908e-04,
>> -7.62090849e-04, -7.54949690e-04, -7.45188971e-04,
>> -7.33641860e-04, -7.20989200e-04])
>>
>> for 2 distributions
>>
>>>>> np.max(np.abs(stats.gamma.cdf(grid[1:2001], 2, scale=1/2.) - convolved.cdf(np.zeros(2000))[:2000]))
>> 0.00039327739646649595
>>
>> Josef
>>
>>>
>>> Josef
>>>
>>>>
>>>> Josef
>>>>
>>>>
>>>>>
>>>>> Josef
>>>>>
>>>>>>
>>>>>> Thanks in advance
>>>>>>
>>>>>> Nicky
>>>>>>
>>>>>> import numpy as np
>>>>>> import scipy.stats
>>>>>> from scipy.stats import poisson, uniform, expon
>>>>>> import pylab as pl
>>>>>>
>>>>>> # I need a grid since np.convolve requires two arrays.
>>>>>>
>>>>>> # choose the grid such that it covers the numerically relevant support
>>>>>> # of the distributions
>>>>>> grid = np.arange(0., 5., 0.001)
>>>>>>
>>>>>> # I need P(D1+D2+D3 <= x)
>>>>>> D1 = expon(scale = 1./2)
>>>>>> D2 = expon(scale = 1./3)
>>>>>> D3 = expon(scale = 1./6)
>>>>>>
>>>>>> class convolved_gen(scipy.stats.rv_continuous):
>>>>>> def __init__(self, D1, D2, grid):
>>>>>> self.D1 = D1
>>>>>> self.D2 = D2
>>>>>> delta = grid[1]-grid[0]
>>>>>> p1 = self.D1.pdf(grid)
>>>>>> p2 = self.D2.pdf(grid)*delta
>>>>>> self.conv = np.convolve(p1, p2)
>>>>>>
>>>>>> super(convolved_gen, self).__init__(name = "convolved")
>>>>>>
>>>>>> def _cdf(self, grid):
>>>>>> cdf = np.cumsum(self.conv)
>>>>>> return cdf/cdf[-1] # ensure that cdf[-1] = 1
>>>>>>
>>>>>> def _pdf(self, grid):
>>>>>> return self.conv[:len(grid)]
>>>>>>
>>>>>> def _stats(self):
>>>>>> m = self.D1.stats("m") + self.D2.stats("m")
>>>>>> v = self.D1.stats("v") + self.D2.stats("v")
>>>>>> return m, v, 0., 0.
>>>>>>
>>>>>>
>>>>>>
>>>>>> convolved = convolved_gen(D1, D2, grid)
>>>>>> conv = convolved()
>>>>>>
>>>>>> convolved2 = convolved_gen(conv, D3, grid)
>>>>>> conv2 = convolved2()
>>>>>>
>>>>>> pl.plot(grid,D1.cdf(grid))
>>>>>> pl.plot(grid,conv.cdf(grid))
>>>>>> pl.plot(grid,conv2.cdf(grid))
>>>>>> pl.show()
>>>>>> _______________________________________________
>>>>>> SciPy-User mailing list
>>>>>> SciPy-User@scipy.org
>>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
More information about the SciPy-User
mailing list