[SciPy-user] Fitting an arbitrary distribution

josef.pktd@gmai... josef.pktd@gmai...
Fri May 22 09:40:00 CDT 2009

On Fri, May 22, 2009 at 12:46 AM, David Warde-Farley <dwf@cs.toronto.edu> wrote:
> On 21-May-09, at 11:47 PM, David Baddeley wrote:
>> Thanks for the prompt replies!
>> I guess what I was meaning was that the PDF / histogram was the sum
>> or multiple Gaussians/normal distibutions. Sorry about the
>> ambiguity. I've had a quick look at the Em package and mixture
>> models, and while my problem is similar they might be a little more
>> general.
>> I guess I should describe the problem in a bit more detail - I'm
>> measuring the length of an objects which can be built up from
>> multiple unit cells. The measured size distribution is thus
>> multimodal, and I want to extract both the unit size and the
>> fraction of objects having each number of unit cells. This makes the
>> problem much more constrained than what is dealt with in the Em
>> package.
> So there is exactly one kind of 'unit cell' and then different lengths
> of objects? Are your observations expected to be particularly noisy?
> What you're staring down isn't quite a standard mixture model, your
> 'hidden variable' is just this unit size.
> Depending on the scale of your problem PyMC would work very well here.
> You'd have one Stochastic for the unit size, another for the integer
> multiple associated with each quantity, a Deterministic that
> multiplies the two and either make the Deterministic observed or add
> another Stochastic with the Deterministic as its mean if you believe
> your observations are noisy with a certain noise distribution.
> Note that since these things you are measuring are supposedly discrete
> multiples of the unit size a Gaussian distribution isn't appropriate
> for the multiples. Something like a Poisson would make more sense.
> Then you'd just fit the model and look at the posterior distributions
> over each quantity of interest, taking the maximum a posteriori
> (posterior mean) estimate where appropriate.  To determine the
> fraction of the population that have a given number of unit cells, you
> basically just count (you'd have an estimate for each observation of
> how many unit cells it has).
> You could also do this by EM, but pyem would not be suitable, as it's
> built specifically for the case of vanilla Gaussian mixtures. PyMC
> would be a ready-made solution which would give you the additional
> flexibility of inferring a distribution over all estimated parameters
> rather than just a point-estimate.

I also think that pymc offers the best and well tested way of doing this.

Just to show how it is possible to do it with stats distributions, I
attach a script where I quickly hacked together different pieces to
try out the maximum likelihood estimation for this case. It's not
cleaned up and has pieces left over from copy and paste, but it's a
proof of concept for how univariate mixture distributions can be
constructed as subclasses of rv_continuous. But to be really useful,
several parts need to be improved.

-------------- next part --------------
# -*- coding: utf-8 -*-

* estimates a mixture of normal distribution given by random sum of iid normal
* only works for good initial conditions and if variance of individual
  normal distribution is not too large
* maximum likelihood estimation using generic fit method is not very precise
  and makes interpretation of parameters more difficult
* maximum likelihood estimation with fixed location and scale done in a quick
  hack, gives good results for the "nice" estimation problem (good initial
  conditions and small variance)
* to use generic methods, only pdf would need to be specified
* restriction: hard coded number of mixtures is four
* to be useful would need AIC, BIC, and covariance matrix of maximum
  likelihood estimates
* does not impose non-negativity of the mixture probabilities, could use
  logit transformation

* this pattern might work well for arbitrary distributions, when the
  estimation problem is nice, e.g. mixture of only 2 distributions or well
  separated distributions
* need proper implementation of estimation with frozen parameters like
  location and scale

#see: for application of frozen parameter in fit

License: same as scipy

import numpy as np
from scipy import stats, special, optimize
from scipy.stats import distributions

class normmix_gen(distributions.rv_continuous):
    def _rvs(self, mu, sig, p1, p2, p3):
        return np.hstack((
                   mu+    sig*stats.norm.rvs(size=p1*self._size),
    def _pdf(self, x, mu, sig, p1, p2, p3):
        return p1*stats.norm.pdf(x,loc=mu, scale=sig) + \
               p2*stats.norm.pdf(x,loc=2.0*mu, scale=2.0*sig) +\
               p3*stats.norm.pdf(x,loc=3.0*mu, scale=3.0*sig) +\
               (1-p1-p2-p3)*stats.norm.pdf(x,loc=4.0*mu, scale=4.0*sig)
    def _nnlf_(self, x, *args):   # inherited version for comparison
        #print 'args in nnlf_', args
        return -np.sum(np.log(self._pdf(x, *args)),axis=0)
    def nnlf_fix(self, theta, x):
        # quick hack to remove loc and scale, removed also bound checking
        # - sum (log pdf(x, theta),axis=0)
        #   where theta are the parameters (including loc and scale)
#        try:
#            loc = theta[-2]
#            scale = theta[-1]
#            args = tuple(theta[:-2])
#        except IndexError:
#            raise ValueError, "Not enough input arguments."
#        if not self._argcheck(*args) or scale <= 0:
#            return inf
#        x = arr((x-loc) / scale)
#        cond0 = (x <= self.a) | (x >= self.b)
#        if (any(cond0)):
#            return inf
#        else:
#            N = len(x)
        args = tuple(theta)
        #print args
        return self._nnlf_(x, *args)# + N*log(scale)
    def fit_fix(self, data, *args, **kwds):
        '''stolen from frozen distribution estimation and partial
        removal of loc and scale'''
        loc0, scale0 = map(kwds.get, ['loc', 'scale'],[0.0, 1.0])
        Narg = len(args)
        if Narg == 0 and hasattr(self, '_fitstart'):
            x0 = self._fitstart(data)
        elif Narg > self.numargs:
            raise ValueError, "Too many input arguments."
            #args += (1.0,)*(self.numargs-Narg)
            # location and scale are at the end
            x0 = args# + (loc0, scale0)
        if 'frozen' in kwds:
            frmask = np.array(kwds['frozen'])
            if len(frmask) != self.numargs+2:
                raise ValueError, "Incorrect number of frozen arguments."
                # keep starting values for not frozen parameters
                x0  = np.array(x0)[np.isnan(frmask)]
            frmask = None
        #print x0
        #print frmask
        return optimize.fmin(self.nnlf_fix, x0,
                    args=(np.ravel(data), ), disp=0)

normmix = normmix_gen(a=0.0,name='normmix',longname='A normmix',
                  shapes=('mu, sig, p1, p2, p3'),

true = (1.,0.05,0.4,0.2,0.2)

rvs = normmix.rvs(size=1000,*true)
#rvs = normmix.rvs(1.,0.01,0.4,0.2,0.2, size=1000)
#rvs = normmix.rvs(1.,0.05,0.5,0.49,0.01, size=1000)
est = normmix.fit(rvs,1.,0.005,0.4,0.2,0.2,loc=0,scale=1)
startval = np.array((1.,0.005,0.4,0.2,0.2))*1.1
#est2 = normmix.fit_fix(rvs,*startval)
est2 = normmix.fit_fix(1.0*rvs,1.05,0.005,0.3,0.21,0.21)
print 'estimate with generic fit'
print est
print 'estimate with fixed loc scale'
print est2

import matplotlib.pyplot as plt
#x = rvs
mu, sig, p1, p2, p3 = true
plt.plot(b,p1*stats.norm.pdf(b,loc=mu, scale=sig),
        b,p2*stats.norm.pdf(b,loc=2.0*mu, scale=2.0*sig),
        b,p3*stats.norm.pdf(b,loc=3.0*mu, scale=3.0*sig))
plt.title('estimated pdf, generic fit')
plt.title('estimated pdf, loc,scale fixed')

print np.array(est)[:-2]-true
print np.array(est2)-true
print 'estimated mean unit size', est2[0]
print 'estimated standard deviation of unit size', est2[1]

More information about the SciPy-user mailing list