[SciPy-User] Segmentation fault in scipy linkage function with large data set

Uri Laserson laserson@mit....
Tue Nov 15 14:19:48 CST 2011


Hi all,

I am trying to cluster a data set with almost 50,000 objects using
hierarchical clustering.

I generate a distance matrix like so:

Y = pdist( unique_seqs, vdj.clusteringcore.levenshtein )

and then try to perform the linkage like so:

Z = sp.cluster.hierarchy.linkage(Y,method=linkage)

The distance matrix is computed fine (albeit after 10 hours or so), and the
segfault occurs in the `linkage` function.

However, I run the same script on many other inputs that are smaller, and
it finishes successfully.  This one largest input is giving me problems.
 You can see the memory usage as a function input size here:

https://picasaweb.google.com/lh/photo/KjPHcosMKxrehK22tslr4A?feat=directlink

and the CPU time here:

https://picasaweb.google.com/lh/photo/ygS_njM80Olja04pRRP2vw?feat=directlink

Each point is one execution of the script with a different set of input
sequences.  The vertical blue line shows the size of the current input,
which is causing the segfaults.

Does anyone have any ideas/suggestions as to what the problem is here.
 When I searched for other possible solutions, I found my own reporting on
the same problem in the past:

http://projects.scipy.org/scipy/ticket/967

However in that case, I was able to reduce the input size so that I don't
segfault.

I am running these on a large linux cluster running python 2.7.1 using
numpy 1.5.0b1 and scipy 0.8.0.

According to the cluster administrators, the process did *not* make any
sudden large requests for resources that were unmet.

Debugging here is especially hard because it take 10 hours to get to the
segfault...sigh...

Thanks!
Uri

.......................................................................................
Uri Laserson
Graduate Student | Biomedical Engineering | Church Lab
Harvard-MIT Division of Health Sciences and Technology
M +1 617 910 0447
laserson@mit.edu
http://web.mit.edu/laserson/www/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20111115/d2c2ba46/attachment.html 


More information about the SciPy-User mailing list