[SciPy-Dev] Faster implementation of cluster.hierarchy

Ralf Gommers ralf.gommers@googlemail....
Wed Oct 12 13:30:35 CDT 2011

On Wed, Oct 12, 2011 at 4:17 PM, Charles R Harris <charlesr.harris@gmail.com
> wrote:

> On Wed, Oct 12, 2011 at 5:12 AM, Conrad Lee <conradlee@gmail.com> wrote:
>> A mathematician at Stanford named Daniel Müllner recently came up with a
>> package that implements the hierarchical clustering methods found in
>> scipy.cluster.hierarchy.  His implementation is in C++, but includes a
>> python API that uses the same interface as scipy.cluster.hierarchy.
>> Müllner has posted benchmarks as well as algorithmic explanations of why
>> his implementation is faster in a paper on arXiv<http://arxiv.org/abs/1109.2378>.
>>  He also has a webpage that describes the package here<http://math.stanford.edu/%7Emuellner/fastcluster.html>
>> .
>> Because the results of the benchmarks look good, I am interested in
>> getting the scikit-learn package to use this implementation for the
>> hierarchical clustering provided by that package.  Rather than integrate the
>> code in scikit-learn, it seems more appropriate to integrate it upstream in
>> scipy.cluster.hierarchy.  Is there anyone who is interested in this
>> integration?  I am inexperienced with integrating C++ code and python code,
>> and also with how things work in the scipy project, so I'm not sure how to
>> proceed.
>> Note: Although Müllner's code is currently under a GPL license, he has
>> stated to me in e-mail that he would be willing to put it under the BSD-2
>> license it somebody put the time to integrate it into scipy.
> Not my area, but I think it is a good thing to encourage such
> contributions.

Agreed, if you mean state of the art algorithms - 2 to 3 orders of magnitude
speedup would be very nice to have.

> If the new code preserves the interface, comes with tests and
> documentation, and performs better, then I am all in favor of getting it in.
> I believe there is already a fair amount of c++ in scipy, so that shouldn't
> be a problem and there are probably folks who can give you advice on how to
> proceed.

Not sure what you consider a fair amount, but it's basically one file in
interpolate and sparse.sparsetools. Plus weave of course, but that's
unmaintained. The sparsetools code is a pain, it takes roughly as much time
to compile as the rest of scipy combined on my machine. Combine that with
the few people who know C++ well, and it leads me to think that the bar for
adding C++ code should be set high.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-dev/attachments/20111012/dabe7659/attachment.html 

More information about the SciPy-Dev mailing list