# [Scipy-tickets] [SciPy] #1553: Improve distribution fitting

SciPy Trac scipy-tickets@scipy....
Thu Nov 3 09:43:01 CDT 2011

```#1553: Improve distribution fitting
-------------------------+--------------------------------------------------
Reporter:  pbrod        |       Owner:  somebody
Type:  enhancement  |      Status:  new
Priority:  normal       |   Milestone:  Unscheduled
Component:  scipy.stats  |     Version:  devel
Keywords:               |
-------------------------+--------------------------------------------------
Currently the fitting of distribution parameters often fails due to
inappropriate starting value(s).
For many distributions with compact support the starting value(s) causes
the negative log-likelihood function (nnlf) to return an infinite value
because one or more data points are outside the support of the
distribution for the given parameters. In this situation the nnlf function
does not discriminate between 1 point or 2 points outside the support of
the distribution. In other words the penalty for having 1 or more points
outside the support of the distribution is the same. For this reason the
optimizer is often not able to move towards the correct solution.

In order to improve the parameter fitting, I would propose to change the
nnlf function to return 100*log(realmax) for every point outside the
support of the distribution instead of inf. The following code snippet
shows how it can be done:

{{{
def nnlf(self, theta, x):
''' Return negative loglikelihood function, i.e., - sum (log
pdf(x, theta),axis=0)
where theta are the parameters (including loc and scale)
'''
try:
loc = theta[-2]
scale = theta[-1]
args = tuple(theta[:-2])
except IndexError:
raise ValueError("Not enough input arguments.")
if not self._argcheck(*args) or scale <= 0:
return inf
x = arr((x-loc) / scale)
cond0 = (x <= self.a) | (self.b <= x)
if (any(cond0)):
# old call: return inf
goodargs = argsreduce(1 - cond0, *((x,)))
goodargs = tuple(goodargs + list(args))
N = len(x)
xmax = floatinfo.machar.xmax
return self._nnlf(*goodargs) + N * log(scale) + Nbad * 100.0 *
log(xmax)
else:
N = len(x)
return self._nnlf(x, *args) + N*log(scale)
}}}

Here is one example from scipy version 0.9:
{{{
>>> from scipy.stats import genpareto
>>> numargs = genpareto.numargs
>>> [ c ] = [0.9,] * numargs
>>> rv = genpareto(c)
>>> R = genpareto.rvs(c, size=100)
>>> par1 = genpareto.fit(R, 0.9, loc=1, scale=1); par1 # start value loc=1
causes the fitting to stop
(0.90000000000000002, 1.0, 1.0)
>>> par0 = genpareto.fit(R, 0.9, loc=0, scale=1); par0
(0.76976497880196693, 0.0012892534460688344, 0.74129858543418492)
>>> genpareto.nnlf(par1,R)
inf
>>> genpareto.nnlf(par0,R)
147.04624611814594
}}}

and here is the result with the enhancement proposed above applied to the
same data:

{{{
>>> par1 = genpareto.fit(R, 0.9, loc=1, scale=1); par1 # start value loc=1
does not stop the fitting, but the solution is not optimal
(0.57335616295273795, 0.0012892554988262234, 1.1866184851484118)
>>> par0 = genpareto.fit(R, 0.9, loc=0, scale=1); par0
(0.76976497880196693, 0.0012892534460688344, 0.74129858543418492)
>>> genpareto.nnlf(par1,R)
150.48509762825049
>>> genpareto.nnlf(par0,R)
147.04624611814594
}}}

--
Ticket URL: <http://projects.scipy.org/scipy/ticket/1553>
SciPy <http://www.scipy.org>
SciPy is open-source software for mathematics, science, and engineering.
```