[Numpy-discussion] extracting a random subset of a vector
Chris Barker
Chris.Barker at noaa.gov
Tue Aug 31 10:41:11 CDT 2004
Curzio Basso wrote:
> import numarray as NA
> import numarray.random_array as RA
>
> N = 1000
> M = 100
> full = NA.arange(N)
> subset = full[RA.permutation(N)][:M]
>
> ---------------------------------------------------------
>
> However, it's quite slow (at least with N~40k),
you can speed it up a tiny bit my subsetting the permutation array first:
subset = full[ RA.permutation(N)[:M] ]
> and from the hotshot
> output is looks like it's the indexing, not the permutation, which takes
> time.
not from my tests:
import numarray.random_array as RA
import numarray as NA
import time
N = 1000000
M = 100
full = NA.arange(N)
start = time.clock()
P = RA.permutation(N)
print "permutation took %F seconds"%(time.clock() - start)
start = time.clock()
subset = full[P[:M]]
print "subsetting took %F seconds"%(time.clock() - start)
which results in:
permutation took 1.640000 seconds
subsetting took 0.000000 seconds
so it's the permutation that takes the time, as I suspected. What would
really speed this up is a random_array.non-repeat-randint() function,
written in C. That way you wouldn't have to permute the entire N values,
when you really just need M of them.
Does anyone else think this would be a useful function? I can't imagine
it wouldn't be that hard to write.
If M <<< N, then you could probably write a little function in Python
that called randint, and removed the repeats. If M is only a little
smaller than N, this would be slow.
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
NOAA/OR&R/HAZMAT (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
More information about the Numpy-discussion
mailing list