[Numpy-discussion] Overlapping ranges

josef.pktd@gmai... josef.pktd@gmai...
Mon Mar 16 17:31:29 CDT 2009


On Mon, Mar 16, 2009 at 5:29 PM, Robert Kern <robert.kern@gmail.com> wrote:
> 2009/3/16 Peter Saffrey <pzs@dcs.gla.ac.uk>:
>
>> At the moment, I'm using a fairly naive approach that finds roughly in the
>> genome (which gene) each point might be and then checking it against the
>> bins in that gene. If I split the problem into chromosomes, I feel sure
>> there must be some super-fast matrix approach I can apply using numpy, but
>> I'm struggling a bit. Can anybody suggest something?
>
> You probably need something algorithmically better, like interval
> trees. There are a couple of C/Python implementations floating around.
>

If I understand your problem correctly, then with a smaller scaled
problem something like this should work
{{{
import numpy as np

B = np.array([[1,3],[2,5],[7,10], [6,15],[14,20]]) # bins
P = np.c_[np.arange(1,16), 4+np.arange(1,16)]  # points

#mask = (~(P[:,0:1]>D[:,1:2].T)) * (~(P[:,1:2]<D[:,0:1].T))
# if the bin ended before the start of the point interval,then it is discarded
# if the bin started after the end of the point interval, then it is discarded
mask =  ~np.logical_or((P[:,0:1]>B[:,1:2].T), (P[:,1:2]<B[:,0:1].T))
indices = mask*np.arange(1,6)
print B
print P
print mask
print indices
}}}

However it creates a result matrix with dimension (number of points)
times (number of bins). If this doesn't fit into memory some looping
is necessary.

Tested on example only.

Josef


More information about the Numpy-discussion mailing list