[Scipy-tickets] [SciPy] #1553: Improve distribution fitting

SciPy Trac scipy-tickets@scipy....
Thu Nov 3 09:43:01 CDT 2011


#1553: Improve distribution fitting
-------------------------+--------------------------------------------------
 Reporter:  pbrod        |       Owner:  somebody   
     Type:  enhancement  |      Status:  new        
 Priority:  normal       |   Milestone:  Unscheduled
Component:  scipy.stats  |     Version:  devel      
 Keywords:               |  
-------------------------+--------------------------------------------------
 Currently the fitting of distribution parameters often fails due to
 inappropriate starting value(s).
 For many distributions with compact support the starting value(s) causes
 the negative log-likelihood function (nnlf) to return an infinite value
 because one or more data points are outside the support of the
 distribution for the given parameters. In this situation the nnlf function
 does not discriminate between 1 point or 2 points outside the support of
 the distribution. In other words the penalty for having 1 or more points
 outside the support of the distribution is the same. For this reason the
 optimizer is often not able to move towards the correct solution.

 In order to improve the parameter fitting, I would propose to change the
 nnlf function to return 100*log(realmax) for every point outside the
 support of the distribution instead of inf. The following code snippet
 shows how it can be done:

 {{{
      def nnlf(self, theta, x):
         ''' Return negative loglikelihood function, i.e., - sum (log
 pdf(x, theta),axis=0)
            where theta are the parameters (including loc and scale)
         '''
         try:
             loc = theta[-2]
             scale = theta[-1]
             args = tuple(theta[:-2])
         except IndexError:
             raise ValueError("Not enough input arguments.")
         if not self._argcheck(*args) or scale <= 0:
             return inf
         x = arr((x-loc) / scale)
         cond0 = (x <= self.a) | (self.b <= x)
         if (any(cond0)):
             # old call: return inf
             goodargs = argsreduce(1 - cond0, *((x,)))
             goodargs = tuple(goodargs + list(args))
             N = len(x)
             Nbad = sum(cond0)
             xmax = floatinfo.machar.xmax
             return self._nnlf(*goodargs) + N * log(scale) + Nbad * 100.0 *
 log(xmax)
         else:
             N = len(x)
             return self._nnlf(x, *args) + N*log(scale)
 }}}

 Here is one example from scipy version 0.9:
 {{{
 >>> from scipy.stats import genpareto
 >>> numargs = genpareto.numargs
 >>> [ c ] = [0.9,] * numargs
 >>> rv = genpareto(c)
 >>> R = genpareto.rvs(c, size=100)
 >>> par1 = genpareto.fit(R, 0.9, loc=1, scale=1); par1 # start value loc=1
 causes the fitting to stop
 (0.90000000000000002, 1.0, 1.0)
 >>> par0 = genpareto.fit(R, 0.9, loc=0, scale=1); par0
 (0.76976497880196693, 0.0012892534460688344, 0.74129858543418492)
 >>> genpareto.nnlf(par1,R)
 inf
 >>> genpareto.nnlf(par0,R)
 147.04624611814594
 }}}

 and here is the result with the enhancement proposed above applied to the
 same data:

 {{{
 >>> par1 = genpareto.fit(R, 0.9, loc=1, scale=1); par1 # start value loc=1
 does not stop the fitting, but the solution is not optimal
 (0.57335616295273795, 0.0012892554988262234, 1.1866184851484118)
 >>> par0 = genpareto.fit(R, 0.9, loc=0, scale=1); par0
 (0.76976497880196693, 0.0012892534460688344, 0.74129858543418492)
 >>> genpareto.nnlf(par1,R)
 150.48509762825049
 >>> genpareto.nnlf(par0,R)
 147.04624611814594
 }}}

-- 
Ticket URL: <http://projects.scipy.org/scipy/ticket/1553>
SciPy <http://www.scipy.org>
SciPy is open-source software for mathematics, science, and engineering.


More information about the Scipy-tickets mailing list