[Numpy-discussion] Generating random samples without repeats
Fri Sep 19 04:40:04 CDT 2008
2008/9/19 Paul Moore <email@example.com>:
> Robert Kern <robert.kern <at> gmail.com> writes:
>> On Thu, Sep 18, 2008 at 16:55, Paul Moore <pf_moore <at> yahoo.co.uk> wrote:
>> > I want to generate a series of random samples, to do simulations based
>> > on them. Essentially, I want to be able to produce a SAMPLESIZE * N
>> > matrix, where each row of N values consists of either
>> > 1. Integers between 1 and M (simulating M rolls of an N-sided die), or
>> > 2. A sample of N numbers between 1 and M without repeats (simulating
>> > deals of N cards from an M-card deck).
>> > Example (1) is easy, numpy.random.random_integers(1, M, (SAMPLESIZE, N))
>> > But I can't find an obvious equivalent for (2). Am I missing something
>> > glaringly obvious? I'm using numpy - is there maybe something in scipy I
>> > should be looking at?
>> numpy.array([(numpy.random.permutation(M) + 1)[:N]
>> for i in range(SAMPLESIZE)])
> And yet, this takes over 70s and peaks at around 400M memory use, whereas the
> equivalent for (1)
> takes less than half a second, and negligible working memory (both end up
> allocating an array of the same size, but your suggestion consumes temporary
> working memory - I suspect, but can't prove, that the time taken comes from
> memory allocations rather than computation.
> As a one-off cost initialising my data, it's not a disaster, but I anticipate
> using idioms like this later in my calculations as well, where the costs could
> hurt more.
> If I'm going to need to write C code, are there any good examples of this? (I
> guess the source for numpy.random is a good place to start).
This was discussed on one of the mailing lists several months ago. It
turns out that there is no simple way to efficiently choose without
replacement in numpy/scipy. I posted a hack that does this somewhat
efficiently (if SAMPLESIZE>M/2, choose the first SAMPLESIZE of a
permutation; if SAMPLESIZE<M/2, choose with replacement and redraw any
duplicates) but it's not vectorized across many sample sets. Is your
problem large M or large N? what is SAMPLESIZE/M?
More information about the Numpy-discussion