[Numpy-discussion] Multivariate hypergeometric distribution?
josef.pktd@gmai...
josef.pktd@gmai...
Mon Jul 2 20:35:14 CDT 2012
On Mon, Jul 2, 2012 at 8:08 PM, <josef.pktd@gmail.com> wrote:
> On Mon, Jul 2, 2012 at 4:16 PM, Fernando Perez <fperez.net@gmail.com> wrote:
>> Hi all,
>>
>> in recent work with a colleague, the need came up for a multivariate
>> hypergeometric sampler; I had a look in the numpy code and saw we have
>> the bivariate version, but not the multivariate one.
>>
>> I had a look at the code in scipy.stats.distributions, and it doesn't
>> look too difficult to add a proper multivariate hypergeometric by
>> extending the bivariate code, with one important caveat: the hard part
>> is the implementation of the actual discrete hypergeometric sampler,
>> which lives inside of numpy/random/mtrand/distributions.c:
>>
>> https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/distributions.c#L743
>>
>> That code is hand-written C, and it only works for the bivariate case
>> right now. It doesn't look terribly difficult to extend, but it will
>> certainly take a bit of care and testing to ensure all edge cases are
>> handled correctly.
>
> My only foray into this
>
> http://projects.scipy.org/numpy/ticket/921
> http://projects.scipy.org/numpy/ticket/923
>
> This looks difficult to add without a good reference and clear
> description of the algorithm.
>
>>
>> Does anyone happen to have that implemented lying around, in a form
>> that would be easy to merge to add this capability to numpy?
>
> not me, I have never even heard of multivariate hypergeometric distribution.
>
>
> maybe http://hal.inria.fr/docs/00/11/00/56/PDF/perm.pdf p.11
> with some properties http://www.math.uah.edu/stat/urn/MultiHypergeometric.html
>
> I've seen one other algorithm, that seems to need N (number of draws
> in hypergeom) random variables for one multivariate hypergeometric
> random draw, which seems slow to me.
>
> But maybe someone has it lying around.
Now I have a pure num/sci/python version around.
A bit more than an hour, so no guarantees, but freq and pmf look close enough.
Josef
>
> Josef
>
>>
>> Thanks,
>>
>> f
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
-------------- next part --------------
# -*- coding: utf-8 -*-
"""
Created on Mon Jul 02 20:23:08 2012
Author: Josef Perktold
"""
import numpy as np
n_balls = [5, 3, 4]
n_sample = 5
size = 100000
p = len(n_balls)
n_all = sum(n_balls)
rvs = np.zeros((size, p), int)
n_bad = n_all
n_remain = n_sample * np.ones(size, int)
for ii in range(p-1):
n_good = n_balls[ii]
n_bad = n_bad - n_good
rvs_ii = rvs[:,ii]
mask = n_remain >= 1
need = mask.sum()
rvs_ii[mask] = np.random.hypergeometric(n_good, n_bad, n_remain[mask], size=need)
rvs[:,ii] = rvs_ii
n_remain = np.maximum(n_remain - rvs_ii, 0)
rvs[:, -1] = n_sample - rvs.sum(1)
#print rvs
print rvs.mean(0) * n_all / n_sample
u, idx = np.unique(rvs.view([('', int)]*3), return_inverse=True)
u_arr = u.view(int).reshape(len(u), -1)
count = np.bincount(idx)
freq = count * 1. / len(idx)
from scipy.misc import comb
def pmf(x, n_balls):
x = np.asarray(x)
n_balls = np.asarray(n_balls)
#p = len(n_balls)
ret = np.product(comb(n_balls, x)) / comb(n_balls.sum(), x.sum())
return ret
print
print freq
for x,fr in zip(u_arr, freq):
th = pmf(x, n_balls)
print x, np.round(th, 5), fr, np.round(fr - th, 10)
More information about the NumPy-Discussion
mailing list