[SciPy-User] Help with clustering : Memory Error with large dataset
Emanuele Olivetti
emanuele@relativita....
Sat Apr 7 14:21:40 CDT 2012
Hi,
You might want to have a look to scikit-learn
http://scikit-learn.org
In particular to:
http://scikit-learn.org/stable/modules/clustering.html
Maybe using a connectivity matrix (e.g. through a
BallTree) could solve the computational issue. See
for example:
http://scikit-learn.org/stable/auto_examples/cluster/plot_ward_structured_vs_unstructured.html#example-cluster-plot-ward-structured-vs-unstructured-py
Another possibility could be their clever implementation
of k-means.
Best,
Emanuele
On 04/05/2012 11:10 PM, Abhishek Pratap wrote:
> Hey Guys
>
> I am re-posting a message I had sent to numpy mailing list earlier. In
> summary I need help with clustering. My input dataset is about 1-2
> million x,y coordinates which I would like to cluster together for ex
> using DBSCAN algo. I tried it on a small data set and it works fine.
> When I increase my input size it crashes. Can I be more efficient ?
> More details copied below.
>
> Thanks!
> -Abhi
>
>
> ===message from numpy mailing list====
>
>
> I am new to both python and more so to numpy. I am trying to cluster
> close to a 900K points using DBSCAN algo. My input is a list of ~900k
> tuples each having two points (x,y) coordinates. I am converting them
> to numpy array and passing them to pdist method of
> scipy.spatial.distance for calculating distance between each point.
>
> Here is some size info on my numpy array
> shape of input array : (828575, 2)
> Size : 6872000 bytes
>
> I think the error has something to do with the default double dtype of
> numpy array of pdist function. I would appreciate if you could help me
> debug this. I am sure I overlooking some naive thing here
>
> See the traceback below.
>
>
> MemoryError Traceback (most recent call last)
> /house/homedirs/a/apratap/Dropbox/dev/ipython/<ipython-input-83-ee29361b7276>
> in<module>()
> 36
> 37 print cleaned_senseBam
> ---> 38 cluster_pet_points_per_chromosome(sense_bamFile)
>
> /house/homedirs/a/apratap/Dropbox/dev/ipython/<ipython-input-83-ee29361b7276>
> in cluster_pet_points_per_chromosome(bamFile)
> 30 print 'Size of list points is %d' % sys.getsizeof(points)
> 31 print 'Size of numpy array is %d' %
> sys.getsizeof(points_array)
> ---> 32 cluster_points_DBSCAN(points_array)
> 33 #print points_array
>
> 34
>
> /house/homedirs/a/apratap/Dropbox/dev/ipython/<ipython-input-72-77005d7cd900>
> in cluster_points_DBSCAN(data_numpy_array)
> 9 def cluster_points_DBSCAN(data_numpy_array):
> 10 #eucledian distance calculation
>
> ---> 11 D = distance.pdist(data_numpy_array)
> 12 S = distance.squareform(D)
> 13 H = 1 - S/np.max(S)
>
> /house/homedirs/a/apratap/playground/software/epd-7.2-2-rh5-x86_64/lib/python2.7/site-packages/scipy/spatial/distance.pyc
> in pdist(X, metric, p, w, V, VI)
> 1155
> 1156 m, n = s
> -> 1157 dm = np.zeros((m * (m - 1) / 2,), dtype=np.double)
> 1158
> 1159 wmink_names = ['wminkowski', 'wmi', 'wm', 'wpnorm']
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
More information about the SciPy-User
mailing list