[Numpy-discussion] Overlapping ranges
Mon Mar 16 16:22:22 CDT 2009
I'm trying to file a set of data points, defined by genome coordinates, into bins, also based on genome coordinates. Each data point is (chromosome, start, end, point) and each bin is (chromosome, start, end). I have about 140 million points to file into around 100,000 bins. Both are (roughly) evenly distributed over the 24 chromosomes (1-22, X and Y). Genome coordinates are integers and my data points are floats. For each data point, (end - start) is roughly 1000, but the bins are are of uneven widths. Bins might have also overlap - in that case, I need to know all the bins that a point overlaps.
By overlap, I mean the start or end of the data point (or both) is inside the bin or that the point entirely covers the bin.
At the moment, I'm using a fairly naive approach that finds roughly in the genome (which gene) each point might be and then checking it against the bins in that gene. If I split the problem into chromosomes, I feel sure there must be some super-fast matrix approach I can apply using numpy, but I'm struggling a bit. Can anybody suggest something?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Numpy-discussion