[Numpy-discussion] Generating random samples without repeats
Fri Sep 19 05:17:51 CDT 2008
Anne Archibald <peridot.faceted <at> gmail.com> writes:
> This was discussed on one of the mailing lists several months ago. It
> turns out that there is no simple way to efficiently choose without
> replacement in numpy/scipy.
That reassures me that I'm not missing something obvious! I'm pretty new with
numpy (I've lurked here for a number of years, but never had a real-life need
to use numpy until now).
> I posted a hack that does this somewhat
> efficiently (if SAMPLESIZE>M/2, choose the first SAMPLESIZE of a
> permutation; if SAMPLESIZE<M/2, choose with replacement and redraw any
> duplicates) but it's not vectorized across many sample sets. Is your
> problem large M or large N? what is SAMPLESIZE/M?
It's actually large SAMPLESIZE. As an example, I'm simulating repeated deals
of poker hands from a deck of cards: M=52, N=5, SAMPLESIZE=1000000.
For now, Robert's approach will work, but it will start blowing up when I want
100 million samples - I don't have the memory to hold all the data (4 bytes
for an int * N=5 * 100000000 = 2GB plus change). So I'll need to allocate
(say) 1 million at a time in a loop and accumulate my results. That's when 70-
second costs to allocate start to hurt. (After all, this is just the setup -
I've got my actual calculations to do as well!!!)
I'll stick with Robert's approach for now, and see if I can knock up something
using Cython once I really need the speed.
More information about the Numpy-discussion