[SciPy-User] Off by one bug in Scipy.stats.hypergeom
josef.pktd@gmai...
josef.pktd@gmai...
Sat Jul 31 21:27:30 CDT 2010
On Sat, Jul 31, 2010 at 8:48 PM, Jacob Biesinger
<jake.biesinger@gmail.com> wrote:
> Hi!
> Perhaps I'm using the module incorrectly, but it looks like the x parameter
> in scipy.stats.hypergeom is off by one. Specifically, I think it's
> one-too-high.
> From the wikipedia article
> http://en.wikipedia.org/wiki/Hypergeometric_distribution#Application_and_example (I
> know they could be wrong-- just hear me out on this),
> scipy.stats.hypergeom?
> Hypergeometric distribution
>
> Models drawing objects from a bin.
> M is total number of objects, n is total number of Type I objects.
> RV counts number of Type I objects in N drawn without replacement
> from
> population.
> So translating wikipedia's example...
> Pr(x=4; M=50, n=5, N=10) = (choose(5,4) * choose(50-5, 10-4)) /
> choose(50,10) = .003964583
> Pr(x=5; M=50, n=5, N=10) = (choose(5,5) * choose(50-5, 10-5)) /
> choose(50,10) = .0001189375
> Which you can check with the python code:
> from scipy import comb as chse # "combination" => choose
>
> float((chse(5,4, exact=1) * chse(50-5,10-4, exact=1))) / chse(50,10,exact=1)
> # example one
> 0.0039645830580150654155 # okay!
> float((chse(5,5, exact=1) * chse(50-5,10-5, exact=1))) / chse(50,10,exact=1)
> # example two
> 0.00011893749174045196247 # okay!
> Try example one with scipy.stats.hypergeom:
> # scipy.stats.hypergeom.sf(x, M, n, N)
> scipy.stats.hypergeom.sf(4,50,5,10)
> 0.00011893749169422652 # correct value for x=5, not x=4
> scipy.stats.hypergeom.sf(5,50,5,10)
> -4.6185277824406512e-14 # wrong
> It seems that changing the loc value from =0 (default) to =1 fixes the
> issue...
> scipy.stats.hypergeom.sf(4,50,5,10, loc=1)
> 0.0040835205497095073 # close enough
> scipy.stats.hypergeom.sf(5,50,5,10, loc=1)
> 0.00011893749169422652 # okay!
> Am I using the package wrong?
I don't know why you are using the survival function.
>From some quick examples, hypergeom looks ok:
pmf has identical results to the Wikipedia example you referenced
>>> stats.hypergeom.pmf([4,5],50,5,10)
array([ 0.00396458, 0.00011894])
>>> stats.hypergeom.pmf(4,50,5,10)
0.0039645830580151411
>>> stats.hypergeom.pmf(5,50,5,10)
0.00011893749174045286
consistency of pmf, cdf, sf
>>> stats.hypergeom.pmf(np.arange(5),50,5,10).sum()
0.9998810625082829
>>> stats.hypergeom.cdf(4,50,5,10)
0.9998810625082829
>>> stats.hypergeom.sf(4,50,5,10)
0.00011893749171709711
>>> 1-stats.hypergeom.pmf(np.arange(5),50,5,10).sum()
0.00011893749171709711
I'm always glad when someone verifies the numbers in scipy stats and
appreciate any reports of inconsistencies or bugs. If there are
differences in the parameterization, then we should make sure that
they are sufficiently documented.
Most distributions are internally tested or against numpy.random and
there are very few tests with other packages.
Thanks,
Josef
> --
> Jake Biesinger
> Graduate Student
> Xie Lab, UC Irvine
> (949) 231-7587
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
More information about the SciPy-User
mailing list