[Numpy-discussion] indexing, searchsorting, ...

Jan Strube curiousjan@gmail....
Mon Jan 25 15:38:48 CST 2010


Dear List,

I'm trying to speed up a piece of code that selects a subsample based on some criteria:
Setup:
I have two samples, raw and cut. Cut is a pure subset of raw, all elements in cut are also in raw, and cut is derived from raw by applying some cuts.
Now I would like to select a random subsample of raw and find out how many are also in cut. In other words, some of those random events pass the cuts, others don't.
So in principle I have 

randomSample = np.random.random_integers(0, len(raw)-1, size=sampleSize)
random_that_pass1 = [r for r in raw[randomSample] if r in cut]

This is fine (I hope), but slow.
I have seen searchsorted mentioned as a possible way to speed this up.
Now it gets complicated. I'm creating a boolean array that contains True, wherever a raw event is in cut.

raw_sorted = np.sort(raw)
cut_sorted = np.sort(cut)
passed = np.searchsorted(raw_sorted, cut_sorted)
raw_bool = np.zeros(len(raw), dtype='bool')
raw_bool[passed] = True

Now I create a second boolean array that is set to True at the random values. The events I care about are the ones that pass the cuts and are selected by the random selection:

sample_bool = np.zeros(len(raw), dtype='bool')
sample_bool[randomSample] = True
random_that_pass2 = raw[np.logical_and(raw_bool, sample_bool)]

The problem comes in now:
random_that_pass2 and random_that_pass1 have different lengths!!! 
Sometimes one is longer, sometimes the other. I am completely at a loss to explain this.
I tend to believe the slow selection leading to random_that_pass1, because it's only two lines, but I don't understand where the other selection could fail.

Unfortunately, the samples that give me trouble are 2.2 MB, so maybe a bit large to mail around, but I can put it somewhere if needed.
Thank you for your help,
Cheers,
    Jan



More information about the NumPy-Discussion mailing list