[Numpy-discussion] sample without replacement
Sturla Molden
sturla@molden...
Tue Dec 21 13:33:49 CST 2010
We often need to generate more than one such sample from an array, e.g.
for permutation tests. If we shuffle an array x of size N and use x[:M] as
a random sample "without replacement", we just need to put them back
randomly to get the next sample (cf. Fisher-Yates shuffle). That way we
get O(M) amortized complexity for each sample of size M. Only the first
sample will have complexity O(N).
Sturla
> I know this question came up on the mailing list some time ago
> (19/09/2008), and the conclusion was that yes, you can do it more or
> less efficiently in pure python; the trick is to use two different
> methods. If your sample is more than, say, a quarter the size of the
> set you're drawing from, you permute the set and take the first few.
> If your sample is smaller, you draw with replacement, then redraw the
> duplicates, and repeat until there aren't any more duplicates. Since
> you only do this when your sample is much smaller than the population
> you don't need to repeat many times.
>
> Here's the code I posted to the previous discussion (not tested this
> time around) with comments:
>
> '''
> def choose_without_replacement(m,n,repeats=None):
> """Choose n nonnegative integers less than m without replacement
>
> Returns an array of shape n, or (n,repeats).
> """
> if repeats is None:
> r = 1
> else:
> r = repeats
> if n>m:
> raise ValueError, "Cannot find %d nonnegative integers less
> than %d" % (n,m)
> if n>m/2:
> res = np.sort(np.random.rand(m,r).argsort(axis=0)[:n,:],axis=0)
> else:
> res = np.random.random_integers(m,size=(n,r))
> while True:
> res = np.sort(res,axis=0)
> w = np.nonzero(np.diff(res,axis=0)==0)
> nr = len(w[0])
> if nr==0:
> break
> res[w] = np.random.random_integers(m,size=nr)
>
> if repeats is None:
> return res[:,0]
> else:
> return res
>
> For really large values of repeats it does too much sorting; I didn't
> have the energy to make it pull all the ones with repeats to the
> beginning so that only they need to be re-sorted the next time
> through. Still, the expected number of trips through the while loop
> grows only logarithmically with repeats, so it shouldn't be too bad.
> '''
>
> Anne
>
> On 20 December 2010 12:13, John Salvatier <jsalvati@u.washington.edu>
> wrote:
>> I think this is not possible to do efficiently with just numpy. If you
>> want
>> to do this efficiently, I wrote a no-replacement sampler in Cython some
>> time
>> ago (below). I hearby release it to the public domain.
>>
>> '''
>>
>> Created on Oct 24, 2009
>> http://stackoverflow.com/questions/311703/algorithm-for-sampling-without-replacement
>> @author: johnsalvatier
>>
>> '''
>>
>> from __future__ import division
>>
>> import numpy
>>
>> def random_no_replace(sampleSize, populationSize, numSamples):
>>
>>
>>
>> samples = numpy.zeros((numSamples, sampleSize),dtype=int)
>>
>>
>>
>> # Use Knuth's variable names
>>
>> cdef int n = sampleSize
>>
>> cdef int N = populationSize
>>
>> cdef i = 0
>>
>> cdef int t = 0 # total input records dealt with
>>
>> cdef int m = 0 # number of items selected so far
>>
>> cdef double u
>>
>> while i < numSamples:
>>
>> t = 0
>>
>> m = 0
>>
>> while m < n :
>>
>>
>>
>> u = numpy.random.uniform() # call a uniform(0,1) random
>> number
>> generator
>>
>> if (N - t)*u >= n - m :
>>
>>
>>
>> t += 1
>>
>>
>>
>> else:
>>
>>
>>
>> samples[i,m] = t
>>
>> t += 1
>>
>> m += 1
>>
>>
>>
>> i += 1
>>
>>
>>
>> return samples
>>
>>
>>
>> On Mon, Dec 20, 2010 at 8:28 AM, Alan G Isaac <alan.isaac@gmail.com>
>> wrote:
>>>
>>> I want to sample *without* replacement from a vector
>>> (as with Python's random.sample). I don't see a direct
>>> replacement for this, and I don't want to carry two
>>> PRNG's around. Is the best way something like this?
>>>
>>> permutation(myvector)[:samplesize]
>>>
>>> Thanks,
>>> Alan Isaac
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
More information about the NumPy-Discussion
mailing list