[Numpy-discussion] sample without replacement
Anne Archibald
aarchiba@physics.mcgill...
Tue Dec 21 09:39:01 CST 2010
I know this question came up on the mailing list some time ago
(19/09/2008), and the conclusion was that yes, you can do it more or
less efficiently in pure python; the trick is to use two different
methods. If your sample is more than, say, a quarter the size of the
set you're drawing from, you permute the set and take the first few.
If your sample is smaller, you draw with replacement, then redraw the
duplicates, and repeat until there aren't any more duplicates. Since
you only do this when your sample is much smaller than the population
you don't need to repeat many times.
Here's the code I posted to the previous discussion (not tested this
time around) with comments:
'''
def choose_without_replacement(m,n,repeats=None):
"""Choose n nonnegative integers less than m without replacement
Returns an array of shape n, or (n,repeats).
"""
if repeats is None:
r = 1
else:
r = repeats
if n>m:
raise ValueError, "Cannot find %d nonnegative integers less
than %d" % (n,m)
if n>m/2:
res = np.sort(np.random.rand(m,r).argsort(axis=0)[:n,:],axis=0)
else:
res = np.random.random_integers(m,size=(n,r))
while True:
res = np.sort(res,axis=0)
w = np.nonzero(np.diff(res,axis=0)==0)
nr = len(w[0])
if nr==0:
break
res[w] = np.random.random_integers(m,size=nr)
if repeats is None:
return res[:,0]
else:
return res
For really large values of repeats it does too much sorting; I didn't
have the energy to make it pull all the ones with repeats to the
beginning so that only they need to be re-sorted the next time
through. Still, the expected number of trips through the while loop
grows only logarithmically with repeats, so it shouldn't be too bad.
'''
Anne
On 20 December 2010 12:13, John Salvatier <jsalvati@u.washington.edu> wrote:
> I think this is not possible to do efficiently with just numpy. If you want
> to do this efficiently, I wrote a no-replacement sampler in Cython some time
> ago (below). I hearby release it to the public domain.
>
> '''
>
> Created on Oct 24, 2009
> http://stackoverflow.com/questions/311703/algorithm-for-sampling-without-replacement
> @author: johnsalvatier
>
> '''
>
> from __future__ import division
>
> import numpy
>
> def random_no_replace(sampleSize, populationSize, numSamples):
>
>
>
> samples = numpy.zeros((numSamples, sampleSize),dtype=int)
>
>
>
> # Use Knuth's variable names
>
> cdef int n = sampleSize
>
> cdef int N = populationSize
>
> cdef i = 0
>
> cdef int t = 0 # total input records dealt with
>
> cdef int m = 0 # number of items selected so far
>
> cdef double u
>
> while i < numSamples:
>
> t = 0
>
> m = 0
>
> while m < n :
>
>
>
> u = numpy.random.uniform() # call a uniform(0,1) random number
> generator
>
> if (N - t)*u >= n - m :
>
>
>
> t += 1
>
>
>
> else:
>
>
>
> samples[i,m] = t
>
> t += 1
>
> m += 1
>
>
>
> i += 1
>
>
>
> return samples
>
>
>
> On Mon, Dec 20, 2010 at 8:28 AM, Alan G Isaac <alan.isaac@gmail.com> wrote:
>>
>> I want to sample *without* replacement from a vector
>> (as with Python's random.sample). I don't see a direct
>> replacement for this, and I don't want to carry two
>> PRNG's around. Is the best way something like this?
>>
>> permutation(myvector)[:samplesize]
>>
>> Thanks,
>> Alan Isaac
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
More information about the NumPy-Discussion
mailing list