[SciPy-User] Help with clustering : Memory Error with large dataset

Emanuele Olivetti emanuele@relativita....
Sat Apr 7 14:21:40 CDT 2012


Hi,

You might want to have a look to scikit-learn
http://scikit-learn.org
In particular to:
http://scikit-learn.org/stable/modules/clustering.html

Maybe using a connectivity matrix (e.g. through a
BallTree) could solve the computational issue. See
for example:
http://scikit-learn.org/stable/auto_examples/cluster/plot_ward_structured_vs_unstructured.html#example-cluster-plot-ward-structured-vs-unstructured-py

Another possibility could be their clever implementation
of k-means.

Best,

Emanuele

On 04/05/2012 11:10 PM, Abhishek Pratap wrote:
> Hey Guys
>
> I am re-posting a message I had sent to numpy mailing list earlier. In
> summary I need help with clustering. My input dataset is about 1-2
> million x,y coordinates which I would like to cluster together for ex
> using DBSCAN algo. I tried it on a small data set and it works fine.
> When I increase my input size it crashes. Can I be more efficient  ?
> More details copied below.
>
> Thanks!
> -Abhi
>
>
> ===message from numpy mailing list====
>
>
> I am new to both python and more so to numpy. I am trying to cluster
> close to a 900K points using DBSCAN algo. My input is a list of ~900k
> tuples each having two points (x,y) coordinates. I am converting them
> to numpy array and passing them to pdist method of
> scipy.spatial.distance for calculating distance between each point.
>
> Here is some size info on my numpy array
> shape of input array  : (828575, 2)
> Size :  6872000 bytes
>
> I think the error has something to do with the default double dtype of
> numpy array of pdist function. I would appreciate if you could help me
> debug this. I am sure I overlooking some naive thing here
>
> See the traceback below.
>
>
> MemoryError                               Traceback (most recent call last)
> /house/homedirs/a/apratap/Dropbox/dev/ipython/<ipython-input-83-ee29361b7276>
> in<module>()
>      36
>      37 print cleaned_senseBam
> --->  38 cluster_pet_points_per_chromosome(sense_bamFile)
>
> /house/homedirs/a/apratap/Dropbox/dev/ipython/<ipython-input-83-ee29361b7276>
> in cluster_pet_points_per_chromosome(bamFile)
>      30             print 'Size of list points is %d' % sys.getsizeof(points)
>      31             print 'Size of numpy array is %d' %
> sys.getsizeof(points_array)
> --->  32             cluster_points_DBSCAN(points_array)
>      33             #print points_array
>
>      34
>
> /house/homedirs/a/apratap/Dropbox/dev/ipython/<ipython-input-72-77005d7cd900>
> in cluster_points_DBSCAN(data_numpy_array)
>       9 def cluster_points_DBSCAN(data_numpy_array):
>      10     #eucledian distance calculation
>
> --->  11     D = distance.pdist(data_numpy_array)
>      12     S = distance.squareform(D)
>      13     H = 1 - S/np.max(S)
>
> /house/homedirs/a/apratap/playground/software/epd-7.2-2-rh5-x86_64/lib/python2.7/site-packages/scipy/spatial/distance.pyc
> in pdist(X, metric, p, w, V, VI)
>    1155
>    1156     m, n = s
> ->  1157     dm = np.zeros((m * (m - 1) / 2,), dtype=np.double)
>    1158
>    1159     wmink_names = ['wminkowski', 'wmi', 'wm', 'wpnorm']
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>



More information about the SciPy-User mailing list