[Numpy-discussion] Alternative to R phyper

Robert Kern robert.kern@gmail....
Mon Apr 30 06:27:26 CDT 2012


On Mon, Apr 30, 2012 at 12:09, Bruno Santos <bacmsantos@gmail.com> wrote:
> Hello everyone,
>
> I have a bit of code where I am using rpy2 to import R phyper so I can
> perform an hypergeometric test. Unfortunately our cluster does not have a
> functional installation of rpy2 working. So I am wondering if I could
> translate to scipy which would make the code completly independent of R. The
> python code I have is as following:
>
>
> def lphyper(self,q,m,n,k):
>         """
>         self.phyper(self,q,m,n,k)
>         Calculate p-value using R function phyper from rpy2 low-level
>         interface.
>         "R Documentation
>         phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)
>         q: vector of quantiles representing the number of white balls
>             drawn without replacement from an urn which contains both
>             black and white balls.
>         m: the number of white balls in the urn.
>         n: the number of black balls in the urn.
>         k: the number of balls drawn from the urn.
>         log.p: logical; if TRUE, probabilities p are given as log(p).
>         lower.tail: logical; if TRUE (default), probabilities are P[X <= x],
>             otherwise, P[X > x].
>         """
>         phyper_q = SexpVector([q,], rinterface.INTSXP)
>         phyper_m = SexpVector([m,], rinterface.INTSXP)
>         phyper_n = SexpVector([n,], rinterface.INTSXP)
>         phyper_k = SexpVector([k,], rinterface.INTSXP)
>         return phyper(phyper_q,phyper_m,phyper_n,phyper_k,**myparams)[0]
>
> I have looked to scipy.stats.hypergeom but it is giving me a different
> result which is also negative.
>> 1-phyper(45, 92, 7518, 1329)
> [1] 6.92113e-13
>
> In [24]: stats.hypergeom.sf(45,(92+7518),92,1329)
> Out[24]: -8.4343643180773142e-12

The CDF (CMF? whatever) for stats.hypergeom is not implemented
explicitly, so it falls back to the default implementation of just
summing up the PMF. You are falling victim to floating point error
accumulation such that the sum exceeds 1.0, so the survival function
1-CDF ends up negative. I don't think we have a better implementation
of the CDF anywhere in scipy, sorry.

-- 
Robert Kern


More information about the NumPy-Discussion mailing list