[Numpy-discussion] MemoryError : with scipy.spatial.distance

Abhishek Pratap apratap@lbl....
Wed Apr 4 18:17:40 CDT 2012

Hey Guys

I am new to both python and more so to numpy. I am trying to cluster
close to a 900K points using DBSCAN algo. My input is a list of ~900k
tuples each having two points (x,y) coordinates. I am converting them
to numpy array and passing them to pdist method of
scipy.spatial.distance for calculating distance between each point.

Here is some size info on my numpy array
shape of input array  : (828575, 2)
Size :  6872000 bytes

I think the error has something to do with the default double dtype
of numpy array of pdist function. I would appreciate if you could help
me debug this. I am sure I overlooking some naive thing here

See the traceback below.

MemoryError                               Traceback (most recent call last)
in <module>()
     37 print cleaned_senseBam
---> 38 cluster_pet_points_per_chromosome(sense_bamFile)

in cluster_pet_points_per_chromosome(bamFile)
     30             print 'Size of list points is %d' % sys.getsizeof(points)
     31             print 'Size of numpy array is %d' %
---> 32             cluster_points_DBSCAN(points_array)
     33             #print points_array


in cluster_points_DBSCAN(data_numpy_array)
      9 def cluster_points_DBSCAN(data_numpy_array):
     10     #eucledian distance calculation

---> 11     D = distance.pdist(data_numpy_array)
     12     S = distance.squareform(D)
     13     H = 1 - S/np.max(S)

in pdist(X, metric, p, w, V, VI)
   1156     m, n = s
-> 1157     dm = np.zeros((m * (m - 1) / 2,), dtype=np.double)
   1159     wmink_names = ['wminkowski', 'wmi', 'wm', 'wpnorm']

More information about the NumPy-Discussion mailing list