I think what you are asking is more of a research question than a
Scipy/Numpy question.
You have to think about the problem to see how you can reduce the
amount of data (sampling, streaming, multi-staged clustering,
min-hashing, space-filling curves etc) and think of the space
complexity of the algorithm that suits it. At this scale, plug and
play doesnt work as often, some work in appropriately reformulating
the problem is required.
On Thu, Apr 5, 2012 at 4:10 PM, Abhishek Pratap <apratap@lbl.gov> wrote:
> Hey Guys
>
> I am re-posting a message I had sent to numpy mailing list earlier. In
> summary I need help with clustering. My input dataset is about 1-2
> million x,y coordinates which I would like to cluster together for ex
> using DBSCAN algo. I tried it on a small data set and it works fine.
> When I increase my input size it crashes. Can I be more efficient ?
> More details copied below.
>
> Thanks!
> -Abhi
>
>
> ===message from numpy mailing list====
>
>
> I am new to both python and more so to numpy. I am trying to cluster
> close to a 900K points using DBSCAN algo. My input is a list of ~900k
> tuples each having two points (x,y) coordinates. I am converting them
> to numpy array and passing them to pdist method of
> scipy.spatial.distance for calculating distance between each point.
>
> Here is some size info on my numpy array
> shape of input array : (828575, 2)
> Size : 6872000 bytes
>
> I think the error has something to do with the default double dtype of
> numpy array of pdist function. I would appreciate if you could help me
> debug this. I am sure I overlooking some naive thing here
>
> See the traceback below.
>
>
> MemoryError Traceback (most recent call last)
> /house/homedirs/a/apratap/Dropbox/dev/ipython/<ipython-input-83-ee29361b7276>
> in <module>()
> 36
> 37 print cleaned_senseBam
> ---> 38 cluster_pet_points_per_chromosome(sense_bamFile)
>
> /house/homedirs/a/apratap/Dropbox/dev/ipython/<ipython-input-83-ee29361b7276>
> in cluster_pet_points_per_chromosome(bamFile)
> 30 print 'Size of list points is %d' % sys.getsizeof(points)
> 31 print 'Size of numpy array is %d' %
> sys.getsizeof(points_array)
> ---> 32 cluster_points_DBSCAN(points_array)
> 33 #print points_array
>
> 34
>
> /house/homedirs/a/apratap/Dropbox/dev/ipython/<ipython-input-72-77005d7cd900>
> in cluster_points_DBSCAN(data_numpy_array)
> 9 def cluster_points_DBSCAN(data_numpy_array):
> 10 #eucledian distance calculation
>
> ---> 11 D = distance.pdist(data_numpy_array)
> 12 S = distance.squareform(D)
> 13 H = 1 - S/np.max(S)
>
> /house/homedirs/a/apratap/playground/software/epd-7.2-2-rh5-x86_64/lib/python2.7/site-packages/scipy/spatial/distance.pyc
> in pdist(X, metric, p, w, V, VI)
> 1155
> 1156 m, n = s
> -> 1157 dm = np.zeros((m * (m - 1) / 2,), dtype=np.double)
> 1158
> 1159 wmink_names = ['wminkowski', 'wmi', 'wm', 'wpnorm']
