[SciPy-Dev] stats, distributions, design choices
Thu Jun 13 18:11:00 CDT 2013
On Thu, Jun 13, 2013 at 5:42 PM, Evgeni Burovski
> On Thu, Jun 13, 2013 at 10:02 PM, <email@example.com> wrote:
>> On Thu, Jun 13, 2013 at 4:46 PM, Evgeni Burovski
>> <firstname.lastname@example.org> wrote:
>> > Looking into the source of stats.distributions, I'm a bit puzzled by the
>> > way
>> > incorrect distribution parameters are handled. Most likely, I'm missing
>> > something very simple, so I'd appreciate if someone knowledgeable can
>> > comment on these:
>> > 1. For incorrect arguments of a distribution, rvs() raises a ValueError,
>> > but
>> > pdf(), pmf() and their relatives return a magic badarg value:
>> >>>> from scipy.stats import norm
>> >>>> norm._argcheck(0, -1)
>> > False
>> >>>> norm.pdf(1, loc=0, scale=-1)
>> > nan
>> >>>> norm.rvs(loc=0, scale=-1)
>> > Traceback (most recent call last):
>> > File "<stdin>", line 1, in <module>
>> > File
>> > "/home/br/virtualenvs/scipy-dev/local/lib/python2.7/site-packages/scipy/stats/distributions.py",
>> > line 617, in rvs
>> > raise ValueError("Domain error in arguments.")
>> > ValueError: Domain error in arguments.
>> > Is there the rationale behind this? I'd naively expect a pdf to raise an
>> > error as well --- or is there a use case where the current behavior is
>> > preferrable?
>> The same reason we also add nans instead of raising an exception in
>> other places.
>> When we calculate vectorized results, we still return the valid
>> results, and nan at the invalid results.
>> If there is only a scalar answer, then we raise an exception if inputs
>> are invalid.
> When, say, trying to evaluate a pdf outside of a distribution support, yes,
> I understand. But what about the case where there's no chance of getting a
> meaningful answer: say, trying to use a normal distribution with sigma=-1,
> like in an example I mentioned.
> Vectorization... hmmm, what would be a use case for something like this:
>>>> from scipy.stats import poisson
>>>> p = poisson([1, -1])
> array([ 0.18393972, nan])
I don't know of many cases, I usually try to avoid wrong parameters
however zero (which could be floating point negative -1e-30) might
show up as special case in some calculations
>>> stats.poisson.pmf(2, [0, 1])
array([ nan, 0.18393972])
The only case I remember right now was that I wanted nan propagation
for the t distribution for t-tests at one point.
In general, it's just the pattern that we follow, why do we have nans
and masked arrays?
If we have lots of samples, then there might be some cases that don't
satisfy the parameter restriction, but we still want all the other
results. For t-test the case is when the variance is zero, i.e. only
constant values in an array.
>>> stats.t.sf(0, 5, scale=0)
>>> stats.t.sf(0, 0)
>>> stats.t.sf(np.nan, 10) # this didn't propagate nan in scipy 0.9
other possible use cases
optimization, when the optimizer tries invalid values (but I think
this is now handled explicitly)
I find it easier to debug if I can see from the result which values
are valid and which are nan, especially when it's a larger array.
> while rvs() throws anyway:
> line 7472, in _rvs
> Pk = k*log(mu)-gamln(k+1) - mu
> File "mtrand.pyx", line 3681, in mtrand.RandomState.poisson
> ValueError: lam < 0
> The fix is trivial, and I can turn this into a pull request, if that's a
> more appropriate medium for this discussion.
No pull request, this mailing list is the right place for design
discussions like this.
The pull request would not be merged without a previous discussion on
the mailing list.
> SciPy-Dev mailing list
More information about the SciPy-Dev