# [SciPy-user] Fitting an arbitrary distribution

josef.pktd@gmai... josef.pktd@gmai...
Fri May 22 09:40:00 CDT 2009

```On Fri, May 22, 2009 at 12:46 AM, David Warde-Farley <dwf@cs.toronto.edu> wrote:
> On 21-May-09, at 11:47 PM, David Baddeley wrote:
>
>> Thanks for the prompt replies!
>>
>> I guess what I was meaning was that the PDF / histogram was the sum
>> or multiple Gaussians/normal distibutions. Sorry about the
>> ambiguity. I've had a quick look at the Em package and mixture
>> models, and while my problem is similar they might be a little more
>> general.
>>
>> I guess I should describe the problem in a bit more detail - I'm
>> measuring the length of an objects which can be built up from
>> multiple unit cells. The measured size distribution is thus
>> multimodal, and I want to extract both the unit size and the
>> fraction of objects having each number of unit cells. This makes the
>> problem much more constrained than what is dealt with in the Em
>> package.
>
> So there is exactly one kind of 'unit cell' and then different lengths
> of objects? Are your observations expected to be particularly noisy?
>
> What you're staring down isn't quite a standard mixture model, your
> 'hidden variable' is just this unit size.
>
> Depending on the scale of your problem PyMC would work very well here.
> You'd have one Stochastic for the unit size, another for the integer
> multiple associated with each quantity, a Deterministic that
> multiplies the two and either make the Deterministic observed or add
> another Stochastic with the Deterministic as its mean if you believe
> your observations are noisy with a certain noise distribution.
>
> Note that since these things you are measuring are supposedly discrete
> multiples of the unit size a Gaussian distribution isn't appropriate
> for the multiples. Something like a Poisson would make more sense.
>
> Then you'd just fit the model and look at the posterior distributions
> over each quantity of interest, taking the maximum a posteriori
> (posterior mean) estimate where appropriate.  To determine the
> fraction of the population that have a given number of unit cells, you
> basically just count (you'd have an estimate for each observation of
> how many unit cells it has).
>
> You could also do this by EM, but pyem would not be suitable, as it's
> built specifically for the case of vanilla Gaussian mixtures. PyMC
> would be a ready-made solution which would give you the additional
> flexibility of inferring a distribution over all estimated parameters
> rather than just a point-estimate.
>

I also think that pymc offers the best and well tested way of doing this.

Just to show how it is possible to do it with stats distributions, I
attach a script where I quickly hacked together different pieces to
try out the maximum likelihood estimation for this case. It's not
cleaned up and has pieces left over from copy and paste, but it's a
proof of concept for how univariate mixture distributions can be
constructed as subclasses of rv_continuous. But to be really useful,
several parts need to be improved.

Josef
-------------- next part --------------
# -*- coding: utf-8 -*-
"""
univariate_mixture.py

Notes
-----
* estimates a mixture of normal distribution given by random sum of iid normal
distributions
* only works for good initial conditions and if variance of individual
normal distribution is not too large
* maximum likelihood estimation using generic fit method is not very precise
and makes interpretation of parameters more difficult
* maximum likelihood estimation with fixed location and scale done in a quick
hack, gives good results for the "nice" estimation problem (good initial
conditions and small variance)
* to use generic methods, only pdf would need to be specified
* restriction: hard coded number of mixtures is four
* to be useful would need AIC, BIC, and covariance matrix of maximum
likelihood estimates
* does not impose non-negativity of the mixture probabilities, could use
logit transformation

* this pattern might work well for arbitrary distributions, when the
estimation problem is nice, e.g. mixture of only 2 distributions or well
separated distributions
* need proper implementation of estimation with frozen parameters like
location and scale

#see: for application of frozen parameter in fit
#http://mail.scipy.org/pipermail/scipy-user/2009-February/019968.html

License: same as scipy
"""

import numpy as np
from scipy import stats, special, optimize
from scipy.stats import distributions

class normmix_gen(distributions.rv_continuous):
def _rvs(self, mu, sig, p1, p2, p3):
return np.hstack((
mu+    sig*stats.norm.rvs(size=p1*self._size),
2.0*mu+2.0*sig*stats.norm.rvs(size=p2*self._size),
3.0*mu+3.0*sig*stats.norm.rvs(size=p3*self._size),
4.0*mu+4.0*sig*stats.norm.rvs(size=round((1-p1-p2-p3)*self._size))))
def _pdf(self, x, mu, sig, p1, p2, p3):
return p1*stats.norm.pdf(x,loc=mu, scale=sig) + \
p2*stats.norm.pdf(x,loc=2.0*mu, scale=2.0*sig) +\
p3*stats.norm.pdf(x,loc=3.0*mu, scale=3.0*sig) +\
(1-p1-p2-p3)*stats.norm.pdf(x,loc=4.0*mu, scale=4.0*sig)
def _nnlf_(self, x, *args):   # inherited version for comparison
#print 'args in nnlf_', args
return -np.sum(np.log(self._pdf(x, *args)),axis=0)

def nnlf_fix(self, theta, x):
# quick hack to remove loc and scale, removed also bound checking
#
# - sum (log pdf(x, theta),axis=0)
#   where theta are the parameters (including loc and scale)
#
#        try:
#            loc = theta[-2]
#            scale = theta[-1]
#            args = tuple(theta[:-2])
#        except IndexError:
#            raise ValueError, "Not enough input arguments."
#        if not self._argcheck(*args) or scale <= 0:
#            return inf
#        x = arr((x-loc) / scale)
#        cond0 = (x <= self.a) | (x >= self.b)
#        if (any(cond0)):
#            return inf
#        else:
#            N = len(x)
args = tuple(theta)
#print args
return self._nnlf_(x, *args)# + N*log(scale)

def fit_fix(self, data, *args, **kwds):
'''stolen from frozen distribution estimation and partial
removal of loc and scale'''
loc0, scale0 = map(kwds.get, ['loc', 'scale'],[0.0, 1.0])
Narg = len(args)

if Narg == 0 and hasattr(self, '_fitstart'):
x0 = self._fitstart(data)
elif Narg > self.numargs:
raise ValueError, "Too many input arguments."
else:
#args += (1.0,)*(self.numargs-Narg)
# location and scale are at the end
x0 = args# + (loc0, scale0)

if 'frozen' in kwds:
frmask = np.array(kwds['frozen'])
if len(frmask) != self.numargs+2:
raise ValueError, "Incorrect number of frozen arguments."
else:
# keep starting values for not frozen parameters
x0  = np.array(x0)[np.isnan(frmask)]
else:
frmask = None

#print x0
#print frmask
return optimize.fmin(self.nnlf_fix, x0,
args=(np.ravel(data), ), disp=0)

normmix = normmix_gen(a=0.0,name='normmix',longname='A normmix',
shapes=('mu, sig, p1, p2, p3'),
extradoc="""normmix""")

true = (1.,0.05,0.4,0.2,0.2)

rvs = normmix.rvs(size=1000,*true)
#rvs = normmix.rvs(1.,0.01,0.4,0.2,0.2, size=1000)
#rvs = normmix.rvs(1.,0.05,0.5,0.49,0.01, size=1000)
est = normmix.fit(rvs,1.,0.005,0.4,0.2,0.2,loc=0,scale=1)
startval = np.array((1.,0.005,0.4,0.2,0.2))*1.1
#est2 = normmix.fit_fix(rvs,*startval)
est2 = normmix.fit_fix(1.0*rvs,1.05,0.005,0.3,0.21,0.21)
print 'estimate with generic fit'
print est
print 'estimate with fixed loc scale'
print est2

import matplotlib.pyplot as plt
#x = rvs
mu, sig, p1, p2, p3 = true
plt.figure()
c,b,d=plt.hist(rvs,bins=50,normed=True)
plt.figure()
plt.plot(b,p1*stats.norm.pdf(b,loc=mu, scale=sig),
b,p2*stats.norm.pdf(b,loc=2.0*mu, scale=2.0*sig),
b,p3*stats.norm.pdf(b,loc=3.0*mu, scale=3.0*sig))
plt.figure()
plt.plot(b,normmix.pdf(b,1,0.05,0.4,0.2,0.2))
#plt.show()
plt.figure()
plt.plot(b,normmix.pdf(b,*est))
plt.title('estimated pdf, generic fit')
#plt.show()
plt.figure()
plt.plot(b,normmix.pdf(b,*est2))
plt.title('estimated pdf, loc,scale fixed')
#plt.show()

print np.array(est)[:-2]-true
print np.array(est2)-true
print 'estimated mean unit size', est2[0]
print 'estimated standard deviation of unit size', est2[1]
```

More information about the SciPy-user mailing list