[SciPy-User] Can I "fix" scale=1 when fitting distributions?

josef.pktd@gmai... josef.pktd@gmai...
Fri Apr 16 01:24:27 CDT 2010


On Thu, Apr 15, 2010 at 11:34 AM,  <josef.pktd@gmail.com> wrote:
> On Wed, Apr 14, 2010 at 12:56 AM,  <josef.pktd@gmail.com> wrote:
>> On Wed, Apr 14, 2010 at 12:52 AM,  <josef.pktd@gmail.com> wrote:
>>> On Wed, Apr 14, 2010 at 12:09 AM, David Ho <itsdho@ucla.edu> wrote:
>>>> Actually, being able to "fix" scale=1 would be very useful for me.
>>>>
>>>> The reason I'm trying to fit a von mises distribution to my data is to find
>>>> "kappa", a measure of the concentration of the data.
>>>>
>>>> On wikipedia, I see that the von mises distribution only really has 2
>>>> parameters: mu (the mean), and kappa (1/kappa being analogous to the
>>>> variance).
>>>> When I use vonmises.fit(), though, I get 3 parameters: kappa, mu, and a
>>>> "scale parameter", respectively.
>>>> However, I don't think the scale parameter for the von mises distribution is
>>>> really independent of kappa, is it? (Am I understanding this correctly?)
>>>> (Using the normal distribution as a comparison, I think the "scale
>>>> parameter" for a normal distribution certainly isn't independent of sigma,
>>>> and norm.fit() only returns mu and sigma. This makes a lot more sense to
>>>> me.)
>>>
>>> for normal  distribution loc=mean and scale  = sqrt(variance)  and has
>>> no shape parameter.
>>>
>>> essentially all distributions are transformed  y = (x-mean)/scale
>>> where x is the original data, and y is the standardized data, the
>>> actual _pdf, _cdf, ... are for the standard version of the
>>> distribution,
>>> it's a generic framework, but doesn't make sense in all cases or
>>> applications mainly when we want to fix the support of the
>>> distribution
>>>
>>>>
>>>> I'm basically trying to use kappa as a measure of the "width" of the
>>>> distribution, so the extra degree of freedom introduced by the scale
>>>> parameter is problematic for me. For two distributions that are
>>>> superficially very similar in "width", I might get wildly varying values of
>>>> kappa.
>>>>
>>>> For example, I once got fitted values of kappa=1, and kappa=400 for two
>>>> distributions that looked very similar in width.
>>>> I thought I must have been doing something wrong... until I saw the scale
>>>> parameters were around 1 and 19, respectively.
>>>> Plotting the fitted distributions:
>>>>>>> xs = numpy.linspace(-numpy.pi, numpy.pi, 361)
>>>>>>> plot(xs, scipy.stats.distributions.vonmises.pdf(xs, 1, loc=0, scale=1))
>>>>>>> plot(xs, scipy.stats.distributions.vonmises.pdf(xs, 400, loc=0,
>>>>>>> scale=19))
>>>
>>> one idea is to check the implied moments, eg. vonmises.stats(1, loc=0,
>>> scale=1, moment='mvsk'))
>>> to see whether the implied mean, variance, skew, kurtosis are similar
>>> in two different parameterizations.
>>> Warning skew, kurtosis is incorrect for a few distributions, I don't
>>> know about vonmises.
>>>
>>>>
>>>> What I'd really like to do is "force" scale=1 when I perform the fitting.
>>>> (In fact, it would be nice to even force the mean of the fitted distribution
>>>> to a precalculated value, as well. I really only want one degree of freedom
>>>> for the fitting algorithm -- I just want it to explore different values of
>>>> kappa.)
>>>>
>>>> Is there any way to do this?
>>>
>>> It's not yet possible in scipy, but easy to write, essentially copy
>>> the existing fit and nnlf methods and fix loc and scale and call the
>>> optimization with only one argument.
>>>
>>> I have a more than one year old ticket:
>>> http://projects.scipy.org/scipy/ticket/832
>>>
>>> It might be possible that it works immediately if you monkey patch
>>> rv_continuous  by adding
>>> nnlf_fr and fit_fr
>>> in
>>> http://mail.scipy.org/pipermail/scipy-user/2009-February/019968.html
>>>
>>> scipy.stats.rv_continuous.nnlf_fr = nnlfr
>>> scipy.stats.rv_continuous.fit_fr = fit_fr
>>
>> correction
>> this should instead refer to
>> scipy.stats.distributions.rv_continuous
>
> It looks like it works, checked only on a few examples, see attachment

An updated version:
I added a quickly written Bootstrap and MonteCarlo check to the examples.
looks pretty good overall.

It's a bit slow with 5000 bootstrap/montecarlo replications.

results for 1 parameter vonmises:

Bootstrap
true parameter value
1.5
MLE estimate of parameters using sample (nobs=500)
[ 1.46689453]
simple bootstrap distribution of parameter estimate (nrepl=5000)
mean = 1.470008, bias=-0.029992
median 1.46650390625
var and std 0.0082988915425 0.0910982521375
mse, rmse 0.00919842753983 0.0959084331007
bootstrap confidence interval (90% coverage)
1.328515625 1.62685546875
bootstrap confidence interval (90% coverage) normal approximation
1.32016444394 1.61985102481
Kolmogorov-Smirnov test for normality of bootstrap distribution
 - estimated parameters, p-values not really correct
(0.018543534773093451, 0.064215366586253375)

MonteCarlo     (replace bootstrap by montecarlo in text)
true parameter value
1.5
MLE estimate of parameters using sample (nobs=500)
[ 1.46689453]
simple bootstrap distribution of parameter estimate (nrepl=5000)
mean = 1.503128, bias=0.003128
median 1.50078125
var and std 0.00797727984433 0.089315619263
mse, rmse 0.00798706268883 0.0893703680692
bootstrap confidence interval (90% coverage)
1.3638671875 1.654296875
bootstrap confidence interval (90% coverage) normal approximation
1.35621663362 1.65003887419
Kolmogorov-Smirnov test for normality of bootstrap distribution
 - estimated parameters, p-values not really correct
(0.02053833535547045, 0.029449209724884552)
>>>

Josef




>
> Josef
>
>>
>> Josef
>>>
>>> and call
>>> vonmises.fit_fr(data, frozen=[np.nan, 0.0, 1.0])
>>>
>>> It's worth a try, but if you only need vonmises it might also be easy
>>> to use scipy.optimize.fmin on a function that wraps vonmises._nnlf and
>>> fixes loc, scale.
>>> To give you an idea, I didn't check what the correct function arguments are ::
>>>
>>> def mynnlf(kappa):
>>>    loc = 0
>>>    scale = 1
>>>    return vonmises.nnlf(kappa, loc, scale)
>>>
>>> scipy.optimize.fmin(mynnlf, startingvalueforkappa)
>>>
>>> I could check this for your case at the end of the week.
>>>
>>> If you have an opinion for a good generic API for fit_semifrozen, let me know.
>>> I hope this helps, and please keep me informed. I would like to make
>>> at least a cookbook recipe out of it.
>>>
>>> Josef
>>>
>>>
>>>> Thanks for your help,
>>>>
>>>> --David Ho
>>>>
>>>>
>>>>> On Tue, Apr 13, 2010 at 3:13 PM,  <josef.pktd@gmail.com> wrote:
>>>>> > In the current fit version, loc (and with it the support of the
>>>>> > distribution) and scale are always estimated. In some cases this is
>>>>> > not desired.
>>>>> > You are transforming the original data to fit into the standard
>>>>> > distribution with loc=0, scale=1 . Do you get reasonable estimates for
>>>>> > loc and scale in this case?
>>>>> > If not, then there is another patch or enhanced fit function that
>>>>> > could take loc and scale as fixed.
>>>>> >
>>>>> > I will look at some details in your function later, especially I'm
>>>>> > curious how the circular statistics works.
>>>>> >
>>>>> > Thanks for the example, maybe I can stop ignoring vonmises.
>>>>> >
>>>>> > Josef
>>>>
>>>> _______________________________________________
>>>> SciPy-User mailing list
>>>> SciPy-User@scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
'''patching scipy to fit distributions with some fixed/frozen parameters

'''

#import numpy
import numpy as np
#import matplotlib.pyplot as mp
#import scipy.stats
from scipy import stats, optimize


########## patching scipy

stats.distributions.vonmises.a = -np.pi
stats.distributions.vonmises.b = np.pi



def _fitstart(self, x):
    # method of moment estimator as starting values, not verified
    # with literature
    loc = np.min([x.min(),0])
    a = 4/stats.skew(x)**2
    scale = np.std(x) / np.sqrt(a)
    return (a, loc, scale)

def nnlf_fr(self, thetash, x, frmask):
    # new frozen version
    # - sum (log pdf(x, theta),axis=0)
    #   where theta are the parameters (including loc and scale)
    #
    try:
        if frmask != None:
            theta = frmask.copy()
            theta[np.isnan(frmask)] = thetash
        else:
            theta = thetash
        loc = theta[-2]
        scale = theta[-1]
        args = tuple(theta[:-2])
    except IndexError:
        raise ValueError, "Not enough input arguments."
    if not self._argcheck(*args) or scale <= 0:
        return np.inf
    x = np.array((x-loc) / scale)
    cond0 = (x <= self.a) | (x >= self.b)
    if (np.any(cond0)):
        return np.inf
    else:
        N = len(x)
        #raise ValueError
        return self._nnlf(x, *args) + N*np.log(scale)

def fit_fr(self, data, *args, **kwds):
    loc0, scale0 = map(kwds.get, ['loc', 'scale'],[0.0, 1.0])
    Narg = len(args)
    
    if Narg == 0 and hasattr(self, '_fitstart'):
        x0 = self._fitstart(data)
    elif Narg > self.numargs:
        raise ValueError, "Too many input arguments."
    else:
        args += (1.0,)*(self.numargs-Narg)
        # location and scale are at the end
        x0 = args + (loc0, scale0)
    
    if 'frozen' in kwds:
        frmask = np.array(kwds['frozen'])
        if len(frmask) != self.numargs+2:
            raise ValueError, "Incorrect number of frozen arguments."
        else:
            # keep starting values for not frozen parameters
            x0  = np.array(x0)[np.isnan(frmask)]
    else:
        frmask = None
        
    #print x0
    #print frmask
    return optimize.fmin(self.nnlf_fr, x0,
                args=(np.ravel(data), frmask), disp=0)

stats.distributions.rv_continuous.fit_fr = fit_fr
stats.distributions.rv_continuous.nnlf_fr = nnlf_fr

########## end patching scipy

for nobs in []:#[20000, 1000, 100]:
    x = stats.vonmises.rvs(1.23, loc=0, scale=1, size=nobs)
    print '\nnobs:', nobs
    print 'true parameter'
    print '1.23, loc=0, scale=1'
    print 'unconstraint'
    print stats.vonmises.fit(x)
    print stats.vonmises.fit_fr(x, frozen=[np.nan, np.nan, np.nan])
    print 'with fixed loc and scale'
    print stats.vonmises.fit_fr(x, frozen=[np.nan, 0.0, 1.0])


distr = stats.gamma
arg, loc, scale = 2.5, 0., 20.

for nobs in []:#[20000, 1000, 100]:
    x = distr.rvs(arg, loc=loc, scale=scale, size=nobs)
    print '\nnobs:', nobs
    print 'true parameter'
    print '%f, loc=%f, scale=%f' % (arg, loc, scale)
    print 'unconstraint'
    print distr.fit(x)
    print distr.fit_fr(x, frozen=[np.nan, np.nan, np.nan])
    print 'with fixed loc and scale'
    print distr.fit_fr(x, frozen=[np.nan, 0.0, 1.0])
    print 'with fixed loc'
    print distr.fit_fr(x, frozen=[np.nan, 0.0, np.nan])


def distfitbootstrap(sample, distr, nrepl=100):
    nobs = len(sample)
    res = np.zeros(nrepl)
    for ii in xrange(nrepl):
        rvsind = np.random.randint(nobs, size=nobs)
        x = sample[rvsind]
        res[ii] = distr.fit_fr(x, frozen=[np.nan, 0.0, 1.0])
    return res

def distfitmc(sample, distr, nrepl=100, distkwds={}):
    arg = distkwds.pop('arg')
    nobs = len(sample)
    res = np.zeros(nrepl)
    for ii in xrange(nrepl):
        x = distr.rvs(arg, size=nobs, **distkwds)
        res[ii] = distr.fit_fr(x, frozen=[np.nan, 0.0, 1.0])
    return res

def printresults(sample, arg, bres):
    print 'true parameter value'
    print arg
    print 'MLE estimate of parameters using sample (nobs=%d)'% (nobs)
    print distr.fit_fr(sample, frozen=[np.nan, 0.0, 1.0])
    print 'simple bootstrap distribution of parameter estimate (nrepl=%d)'% (nrepl)
    print 'mean = %f, bias=%f' % (bres.mean(0), bres.mean(0)-arg)
    print 'median', np.median(bres, axis=0)
    print 'var and std', bres.var(0), np.sqrt(bres.var(0))
    bmse = ((bres - arg)**2).mean(0)
    print 'mse, rmse', bmse, np.sqrt(bmse)
    bressorted = np.sort(bres)
    print 'bootstrap confidence interval (90% coverage)'
    print bressorted[np.floor(nrepl*0.05)], bressorted[np.floor(nrepl*0.95)]
    print 'bootstrap confidence interval (90% coverage) normal approximation'
    print stats.norm.ppf(0.05, loc=bres.mean(), scale=bres.std()),
    print stats.norm.isf(0.05, loc=bres.mean(), scale=bres.std())
    print 'Kolmogorov-Smirnov test for normality of bootstrap distribution'
    print ' - estimated parameters, p-values not really correct'
    print stats.kstest(bres, 'norm', (bres.mean(), bres.std()))

ex = ['gamma', 'vonmises'][1]

if ex == 'gamma':
    distr = stats.gamma
    arg, loc, scale = 2.5, 0., 1
elif ex == 'vonmises':
    distr = stats.vonmises
    arg, loc, scale = 1.5, 0., 1
else:
    raise ValueError('wrong example')

nobs = 500
nrepl = 5000
sample = distr.rvs(arg, loc=loc, scale=scale, size=nobs)

print 'Bootstrap'
bres = distfitbootstrap(sample, distr, nrepl=nrepl )
printresults(sample, arg, bres)

print '\nMonteCarlo'
mcres = distfitmc(sample, distr, nrepl=nrepl,
                        distkwds=dict(arg=arg, loc=loc, scale=scale) )
printresults(sample, arg, mcres)


More information about the SciPy-User mailing list