[Numpy-discussion] sample without replacement

Sturla Molden sturla@molden...
Tue Dec 21 13:33:49 CST 2010


We often need to generate more than one such sample from an array, e.g.
for permutation tests. If we shuffle an array x of size N and use x[:M] as
a random sample "without replacement", we just need to put them back
randomly to get the next sample (cf. Fisher-Yates shuffle). That way we
get O(M) amortized complexity for each sample of size M. Only the first
sample will have complexity O(N).

Sturla


> I know this question came up on the mailing list some time ago
> (19/09/2008), and the conclusion was that yes, you can do it more or
> less efficiently in pure python; the trick is to use two different
> methods. If your sample is more than, say, a quarter the size of the
> set you're drawing from, you permute the set and take the first few.
> If your sample is smaller, you draw with replacement, then redraw the
> duplicates, and repeat until there aren't any more duplicates. Since
> you only do this when your sample is much smaller than the population
> you don't need to repeat many times.
>
> Here's the code I posted to the previous discussion (not tested this
> time around) with comments:
>
> '''
> def choose_without_replacement(m,n,repeats=None):
>    """Choose n nonnegative integers less than m without replacement
>
>    Returns an array of shape n, or (n,repeats).
>    """
>    if repeats is None:
>        r = 1
>    else:
>        r = repeats
>    if n>m:
>        raise ValueError, "Cannot find %d nonnegative integers less
> than %d" %  (n,m)
>    if n>m/2:
>        res = np.sort(np.random.rand(m,r).argsort(axis=0)[:n,:],axis=0)
>    else:
>        res = np.random.random_integers(m,size=(n,r))
>        while True:
>            res = np.sort(res,axis=0)
>            w = np.nonzero(np.diff(res,axis=0)==0)
>            nr = len(w[0])
>            if nr==0:
>                break
>            res[w] = np.random.random_integers(m,size=nr)
>
>    if repeats is None:
>        return res[:,0]
>    else:
>        return res
>
> For really large values of repeats it does too much sorting; I didn't
> have the energy to make it pull all the ones with repeats to the
> beginning so that only they need to be re-sorted the next time
> through. Still, the expected number of trips through the while loop
> grows only logarithmically with repeats, so it shouldn't be too bad.
> '''
>
> Anne
>
> On 20 December 2010 12:13, John Salvatier <jsalvati@u.washington.edu>
> wrote:
>> I think this is not possible to do efficiently with just numpy. If you
>> want
>> to do this efficiently, I wrote a no-replacement sampler in Cython some
>> time
>> ago (below). I hearby release it to the public domain.
>>
>> '''
>>
>> Created on Oct 24, 2009
>> http://stackoverflow.com/questions/311703/algorithm-for-sampling-without-replacement
>> @author: johnsalvatier
>>
>> '''
>>
>> from __future__ import division
>>
>> import numpy
>>
>> def random_no_replace(sampleSize, populationSize, numSamples):
>>
>>
>>
>>     samples  = numpy.zeros((numSamples, sampleSize),dtype=int)
>>
>>
>>
>>     # Use Knuth's variable names
>>
>>     cdef int n = sampleSize
>>
>>     cdef int N = populationSize
>>
>>     cdef i = 0
>>
>>     cdef int t = 0 # total input records dealt with
>>
>>     cdef int m = 0 # number of items selected so far
>>
>>     cdef double u
>>
>>     while i < numSamples:
>>
>>         t = 0
>>
>>         m = 0
>>
>>         while m < n :
>>
>>
>>
>>             u = numpy.random.uniform() # call a uniform(0,1) random
>> number
>> generator
>>
>>             if  (N - t)*u >= n - m :
>>
>>
>>
>>                 t += 1
>>
>>
>>
>>             else:
>>
>>
>>
>>                 samples[i,m] = t
>>
>>                 t += 1
>>
>>                 m += 1
>>
>>
>>
>>         i += 1
>>
>>
>>
>>     return samples
>>
>>
>>
>> On Mon, Dec 20, 2010 at 8:28 AM, Alan G Isaac <alan.isaac@gmail.com>
>> wrote:
>>>
>>> I want to sample *without* replacement from a vector
>>> (as with Python's random.sample).  I don't see a direct
>>> replacement for this, and I don't want to carry two
>>> PRNG's around.  Is the best way something like  this?
>>>
>>>        permutation(myvector)[:samplesize]
>>>
>>> Thanks,
>>> Alan Isaac
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>




More information about the NumPy-Discussion mailing list