[SciPy-Dev] scipy.stats: algorithm to for ticket 1493

josef.pktd@gmai... josef.pktd@gmai...
Wed Apr 25 22:15:33 CDT 2012

On Wed, Apr 25, 2012 at 3:49 PM,  <josef.pktd@gmail.com> wrote:
> On Wed, Apr 25, 2012 at 3:21 PM, nicky van foreest <vanforeest@gmail.com> wrote:
>>>>>>> The difficult cases will be where cdf also doesn't exist and we need
>>>>>>> to get it through integrate.quad, but I don't remember which
>>>>>>> distribution is a good case.
>>>>
>>>> This case is harder indeed. (I assume you mean by 'not exist' that
>>>> there is no closed form expression for the cdf, like the normal
>>>> distribution). Computing the ppf would involve calling quad a lot of
>>>> times. This is wasteful especially since the computation of cdf(b)
>>>> includes the computation of cdf(a) for a < b, supposing that quad runs
>>>> from -np.inf to b. We could repair this by computing cdf(b) = cdf(a) +
>>>> (perhaps I am not clear enough here. If so, let me know.)
>>>
>>> not exists = not defined as _cdf method  could also be scipy.special
>>> if there are no closed form expressions
>>
>> I see, sure.
>>
>>>>>> I just think that we are not able to reach the q=0, q=1 boundaries,
>>>>>> since for some distributions we will run into other numerical
>>>>>> problems. And I'm curious how far we can get with this.
>>>>
>>>> I completely missed to include a test on the obvious cases q >= 1. -
>>>> np.finfo(float).eps and q <= np.finfo(float).eps. It is now in the
>>>> attached file.
>>>
>>>>>> findppf(stats.expon, 1e-30)
>>> -6.3593574850511882e-13
>>
>> This result shows actually that xa and xb are necessary to include in
>> the specification of the distribution. The exponential distribution is
>> (usually) defined only on [0, \infty) not on the negative numbers. The
>> result above is negative though. This is of course a simple
>> consequence of calling brentq. From a user's perspective, though, I
>
> good argument to clean up xa, xb
>
>>
>>> The right answer should be dist.b for q=numerically 1, lower support
>>> point is dist.a but I don't see when we would need it.
>>
>> I agree, provided xa and xb are always properly defined. But then,
>> (just to be nitpicking), the definition of expon does not set xa and
>> xb explicitly. Hence xa = -10, and this is somewhat undesirable, given
>> the negative value above.
>>
>>>>
>>>> The simultaneous updating of left and right in the previous algo is
>>>> wrong. Suppose for instance that cdf(left) < cdf(right) < q. Then both
>>>> left and right would move to the left'. This is clearly wrong. The
>>>> included code should be better.
>>>
>>> would move to the *right* ?
>>
>> Sure.
>>
>>>
>>> I thought the original was a nice trick, we can shift both left and
>>> right since we know it has to be in that direction, the cut of range
>>>
>>> Or do I miss the point?
>>
>> No, you are right. When I wrote this at first, I also thought about
>> the point you bring up here. Then, I was somewhat dissatisfied with
>> calling the while loop twice (suppose the left bound requires
>> updating, then certainly the second while loop (to update the right
>> bound) is unnecessary, and calling cdf(right) is useless). While
>> trying to fix this, I forgot about my initial ideas...
>>
>>>
>>>>
>>>> With regard to the values of xb and xa. Can a ordinary' user change
>>>> these? If so, then the ppf finder should include some protection in my
>>>> opinion. If not, the user will get an error that brentq has not the
>>>> right limits, but this error might be somewhat unexpected. (What has
>>>> brentq to do with finding the ppf?) Of course, looking at the code
>>>> this is clear, but I expect most users will no do so.
>>>
>>> I don't think  ordinary' users should touch xa, xb, but they could.
>>> Except for getting around the limitation in this ticket there is no
>>> reason to change xa, xb, so we could make them private _xa, _xb
>>
>> I think that would be better. Thus, the developer that subclasses
>> rv_continuous should set _xa and _xb properly.
>>
>>>> The code contains two choices about how to handle xa and xb. Do you
>>>> have any preference?
>>>
>>> I don't really like choice 1, because it removes the use of the
>>> predefined xa, xb. On the other hand, with this extension, xa and xb
>>> wouldn't be really necessary anymore.
>>
>> In view of your example with findppf(expon(1e-30)) I prefer to use _xa and _xb.
>>
>>>
>>> another possibility would be to try except brentq with xa, xb first
>>> and get most cases, and switch to your version if needed. I'm not sure
>>> xa, xb are defined well enough that it's worth to go this route,
>>> though.
>>
>> I think that this makes the most sense. The definition of the class
>> should include sensible values of xa and xb.
>>
>> All in all, I would like to make the following proposal to resolve the
>> ticket in a generic way.
>>
>> 1) xa and xb should become private class members _xa and _xb
>> 2) _xa and _xb should be given proper values in the class definition,
>> e.g. expon._xa = 0 and expon._xb = 30., since exp(-30) = 9.35e-14.
>> 3) given a quantile q in the ppf function, include a test on _cdf(_xa)
>> <= q <= _cdf(_xb). If this fails, return an exception with some text
>> that tells that either _cdf(_xa) > q or _cdf(_xb) < q.
>>
>> Given your comments I actually favor all this searching for left and
>> right not that much anymore. It is generic, but it places the
>> responsibility of good code at the wrong place.
>
> 3) I prefer your expanding the search to raising an exception to the
> user. Note also that your 3) is inconsistent with 1). If a user
> visible exception is raised, then the user needs to change xa or xb,
> so it shouldn't be private. That's the current situation (except for a
> more cryptic message).
>
> 2) I'm all in favor, especially for one-side bound distributions,
> where it should be easy to go through those. There might be a few
> where the bound moves with the shape, but the only one I remember is
> genextreme and that has an explicit _ppf
>
> So I would prefer 1), 2) and your new enhanced generic _ppf

forgot to mention

the main reason that I like your expanding search space is that the
shape of the distribution can change a lot. Even if we set xa, xb to
reasonable values for likely shape parameters they won't be good
enough for others, as in the original ticket

>>> stats.invgauss.stats(2)
(array(2.0), array(8.0))
>>> stats.invgauss.stats(7)
(array(7.0), array(343.0))
>>> stats.invgauss.stats(20)
(array(20.0), array(8000.0))
>>> stats.invgauss.stats(100)
(array(100.0), array(1000000.0))
>>> stats.invgauss.cdf(1000, 100)
0.98335562794321207

>>> findppf(stats.invgauss, 0.99, 100)
1926.520850319389
>>> findppf(stats.invgauss, 0.999, 100)
13928.012903371644
>>> findppf(stats.invgauss, 0.999, 1)
8.3548649291400938
---------

to get a rough idea:
for xa, xb and a finite bound either left or right, all have generic
xa=-10 or xb=10

>>> dist_cont = [getattr(stats.distributions, dname)  for dname in dir(stats.distributions) if isinstance(getattr(stats.distributions, dname), stats.distributions.rv_continuous)]

>>> left = [(d.name, d.a, d.xa) for d in dist_cont if not np.isneginf(d.a)]
>>> pprint(left)
[('alpha', 0.0, -10.0),
('anglit', -0.78539816339744828, -10.0),
('arcsine', 0.0, -10.0),
('beta', 0.0, -10.0),
('betaprime', 0.0, -10.0),
('burr', 0.0, -10.0),
('chi', 0.0, -10.0),
('chi2', 0.0, -10.0),
('cosine', -3.1415926535897931, -10.0),
('erlang', 0.0, -10.0),
('expon', 0.0, 0),
('exponpow', 0.0, -10.0),
('exponweib', 0.0, -10.0),
('f', 0.0, -10.0),
('fatiguelife', 0.0, -10.0),
('fisk', 0.0, -10.0),
('foldcauchy', 0.0, -10.0),
('foldnorm', 0.0, -10.0),
('frechet_r', 0.0, -10.0),
('gamma', 0.0, -10.0),
('gausshyper', 0.0, -10.0),
('genexpon', 0.0, -10.0),
('gengamma', 0.0, -10.0),
('genhalflogistic', 0.0, -10.0),
('genpareto', 0.0, -10.0),
('gilbrat', 0.0, -10.0),
('gompertz', 0.0, -10.0),
('halfcauchy', 0.0, -10.0),
('halflogistic', 0.0, -10.0),
('halfnorm', 0.0, -10.0),
('invgamma', 0.0, -10.0),
('invgauss', 0.0, -10.0),
('invnorm', 0.0, -10.0),
('invweibull', 0, -10.0),
('johnsonb', 0.0, -10.0),
('ksone', 0.0, -10.0),
('kstwobign', 0.0, -10.0),
('levy', 0.0, -10.0),
('loglaplace', 0.0, -10.0),
('lognorm', 0.0, -10.0),
('lomax', 0.0, -10.0),
('maxwell', 0.0, -10.0),
('mielke', 0.0, -10.0),
('nakagami', 0.0, -10.0),
('ncf', 0.0, -10.0),
('ncx2', 0.0, -10.0),
('pareto', 1.0, -10.0),
('powerlaw', 0.0, -10.0),
('powerlognorm', 0.0, -10.0),
('rayleigh', 0.0, -10.0),
('rdist', -1.0, -10.0),
('recipinvgauss', 0.0, -10.0),
('rice', 0.0, -10.0),
('semicircular', -1.0, -10.0),
('triang', 0.0, -10.0),
('truncexpon', 0.0, -10.0),
('uniform', 0.0, -10.0),
('wald', 0.0, -10.0),
('weibull_min', 0.0, -10.0),
('wrapcauchy', 0.0, -10.0)]

>>> right = [(d.name, d.b, d.xb) for d in dist_cont if not np.isposinf(d.b)]
>>> pprint(right)
[('anglit', 0.78539816339744828, 10.0),
('arcsine', 1.0, 10.0),
('beta', 1.0, 10.0),
('betaprime', 500.0, 10.0),
('cosine', 3.1415926535897931, 10.0),
('frechet_l', 0.0, 10.0),
('gausshyper', 1.0, 10.0),
('johnsonb', 1.0, 10.0),
('levy_l', 0.0, 10.0),
('powerlaw', 1.0, 10.0),
('rdist', 1.0, 10.0),
('semicircular', 1.0, 10.0),
('triang', 1.0, 10.0),
('uniform', 1.0, 10.0),
('weibull_max', 0.0, 10.0),
('wrapcauchy', 6.2831853071795862, 10.0)]

only pareto has both limits on the same side of zero

>>> pprint ([(d.name, d.a, d.b) for d in dist_cont if d.a*d.b>0])
[('pareto', 1.0, inf)]

genextreme, and maybe one or two others, are missing because finite a,
b are set in _argcheck
vonmises is for circular and doesn't behave properly

only two distributions define non-generic xa or xb

>>> pprint ([(d.name, d.a, d.b, d.xa, d.xb) for d in dist_cont if not d.xa*d.xb==-100])
[('foldcauchy', 0.0, inf, -10.0, 1000), ('recipinvgauss', 0.0, inf, -10.0, 50)]

a pull request setting correct xa, xb would be very welcome

Josef

>
> Josef
>
>>
>> Nicky
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
`