[SciPy-Dev] Fitting distributions [Was Re: Warnings raised (from fit in scipy.stats)]

josef.pktd@gmai... josef.pktd@gmai...
Fri Jun 11 16:42:46 CDT 2010


On Fri, Jun 11, 2010 at 5:11 PM, Skipper Seabold <jsseabold@gmail.com> wrote:
> On Fri, Jun 11, 2010 at 2:49 PM, Vincent Davis <vincent@vincentdavis.net> wrote:
>> On Fri, Jun 11, 2010 at 12:20 PM, Skipper Seabold <jsseabold@gmail.com> wrote:
>>> On Fri, Jun 11, 2010 at 2:17 PM, Vincent Davis <vincent@vincentdavis.net> wrote:
>>>> On Fri, Jun 11, 2010 at 11:34 AM, Skipper Seabold <jsseabold@gmail.com> wrote:
>>>>> On Fri, Jun 11, 2010 at 1:07 PM,  <josef.pktd@gmail.com> wrote:
>>>>>> On Fri, Jun 11, 2010 at 12:45 PM, Skipper Seabold <jsseabold@gmail.com> wrote:
>>>>>>> Since the raising of warning behavior has been changed (I believe), I
>>>>>>> have been running into a lot of warnings in my code when say I do
>>>>>>> something like
>>>>>>>
>>>>>>> In [120]: from scipy import stats
>>>>>>>
>>>>>>> In [121]: y = [-45, -3, 1, 0, 1, 3]
>>>>>>>
>>>>>>> In [122]: v = stats.norm.pdf(y)/stats.norm.cdf(y)
>>>>>>> Warning: invalid value encountered in divide
>>>>>>>
>>>>>>> Sometimes, this is useful to know.  Sometimes, though, it's very
>>>>>>> disturbing when it's encountered in some kind of iteration or
>>>>>>> optimization.  I have been using numpy.clip to get around this in my
>>>>>>> own code, but when it's buried a bit deeper, it's not quite so simple.
>>>>>>>
>>>>>>> Take this example.
>>>>>>>
>>>>>>> In [123]: import numpy as np
>>>>>>>
>>>>>>> In [124]: np.random.seed(12345)
>>>>>>>
>>>>>>> In [125]: B = 6.0
>>>>>>>
>>>>>>> In [126]: x = np.random.exponential(scale=B, size=5000)
>>>>>>>
>>>>>>> In [127]: from scipy.stats import expon
>>>>>>>
>>>>>>> In [128]: expon.fit(x)
>>>>>>>
>>>>>>> <dozens of warnings clipped>
>>>>>>>
>>>>>>> Out[128]: (0.21874043533906118, 5.7122829778172939)
>>>>>>>
>>>>>>> The fit is achieved by fmin (as far as I know, since disp=0 in the
>>>>>>> rv_continuous.fit...), but there are a number of warnings emitted.  Is
>>>>>>> there any middle ground to be had in these type of situations via
>>>>>>> context management perhaps?
>>>>>>>
>>>>>>> Should I file a ticket?
>>>>>>
>>>>>> Which numpy scipy versions are you using?
>>>>
>>>> I get know warnings
>>>>>>> import numpy as np
>>>>>>> np.random.seed(12345)
>>>>>>> B = 6.0
>>>>>>> x = np.random.exponential(scale=B, size=5000)
>>>>>>> from scipy.stats import expon
>>>>>>> expon.fit(x)
>>>> array([  6.43573559e-04,   5.93058867e+00])
>>>>
>>>
>>> You also get different values than I do, which shouldn't be the case.
>>>
>>> I just discovered that my expon.fit(x) just returns the first and
>>> second moments of the data (even when I set floc = 0, I still get the
>>> second moment), and I am trying to figure out why.  It looks like
>>> something is amiss.
>>
>
> So maybe I am missing something (quite likely), but the reason that
> the expon.fit(x), (silently) doesn't work in the above is that
> expon.nnlf returns inf for the default start loc.
>
> Should fit_loc_scale be overwritten for expon?
>
> In [60]: expon.fit_loc_scale(x)
> Out[60]: (0.21874043533906118, 5.7122829778172939)
>
> In [61]: expon.nnlf(expon.fit_loc_scale(x),x)
> Out[61]: inf
>
> Changing fmin to disp=1, gives
>
> In[62]: expon.fit(x)
>
> <snip all the warnings, except from the solver>
>
> Warning: Maximum number of function evaluations has been exceeded.
> Out[62]: (0.21874043533906118, 5.7122829778172939)
>
> The default loc is defined (for this case as)
>
> loc = x.mean() - x.std()
>
> so for any x <= loc it is outside of the domain of the exponential
> distribution when it gets centered in nnlf.
>
> But
>
> In [63]: expon.fit(x,loc=0)
> Optimization terminated successfully.
>         Current function value: 13900.441325
>         Iterations: 59
>         Function evaluations: 107
> Out[63]: (0.00064357307755842945, 5.9303783425183241)

I haven't looked yet in details at the new fit method.

But for many distributions, especially those with a finite boundary,
fit works only if the starting values are well chosen. In this case, I
would set the starting value for loc at (x.min() - a little bit),
unless loc is frozen at zero.

I didn't check where the default starting value for expon has been changed.

Josef


>
> Skipper
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>


More information about the SciPy-Dev mailing list