[Numpy-discussion] Generating random samples without repeats
Fri Sep 19 14:19:09 CDT 2008
2008/9/19 Paul Moore <firstname.lastname@example.org>:
> Anne Archibald <peridot.faceted <at> gmail.com> writes:
>> This was discussed on one of the mailing lists several months ago. It
>> turns out that there is no simple way to efficiently choose without
>> replacement in numpy/scipy.
> That reassures me that I'm not missing something obvious! I'm pretty new with
> numpy (I've lurked here for a number of years, but never had a real-life need
> to use numpy until now).
>> I posted a hack that does this somewhat
>> efficiently (if SAMPLESIZE>M/2, choose the first SAMPLESIZE of a
>> permutation; if SAMPLESIZE<M/2, choose with replacement and redraw any
>> duplicates) but it's not vectorized across many sample sets. Is your
>> problem large M or large N? what is SAMPLESIZE/M?
> It's actually large SAMPLESIZE. As an example, I'm simulating repeated deals
> of poker hands from a deck of cards: M=52, N=5, SAMPLESIZE=1000000.
Er. I must have got my variable names wrong. Anyway, since you are
choosing 5 from 52, repeats should be quite rare, and generating a
(52,1000000) temporary array is expensive. So I would generate a
(5,1000000) array with replacement, then use sort and diff (or unique)
to look for any duplicates along the short axis; then fill those in
with random values drawn with replacement. Repeat until there aren't
any more duplicates. This is roughly the approach my code took, but my
code did it for a single draw, and iterating it a million times will
be expensive. If you redraw for the whole large array at once, it
should be fairly cheap. The only cost will be re-sorting and
re-diffing all those already-sorted hands that didn't need
replacement. Even there you should be able to speed things with
judicious use of fancy indexing.
P.S. here's an implementation, seems to work all right:
"""Choose n nonnegative integers less than m without replacement
Returns an array of shape n, or (n,repeats).
if repeats is None:
r = 1
r = repeats
raise ValueError, "Cannot find %d nonnegative integers less
than %d" % (n,m)
res = np.sort(np.random.rand(m,r).argsort(axis=0)[:n,:],axis=0)
res = np.random.random_integers(m,size=(n,r))
res = np.sort(res,axis=0)
w = np.nonzero(np.diff(res,axis=0)==0)
nr = len(w)
res[w] = np.random.random_integers(m,size=nr)
if repeats is None:
For really large values of repeats it does too much sorting; I didn't
have the energy to make it pull all the ones with repeats to the
beginning so that only they need to be re-sorted the next time
through. Still, the expected number of trips through the while loop
grows only logarithmically with repeats, so it shouldn't be too bad.
More information about the Numpy-discussion