[SciPy-Dev] scipy.spatial.distance.pdist - unnecessarily limited functionality and a suggestion of a possible solution

Ralf Gommers ralf.gommers@googlemail....
Mon Aug 29 12:11:04 CDT 2011


On Sat, Aug 27, 2011 at 2:52 PM, Tomasz J. Kotarba <tomasz@kotarba.net>wrote:

> Hello,
> I would like to suggest a small improvement to pdist().  I have just
> implemented some new similarity/distance metrics which work with
> heterogeneous arrays containing both numerical, categorical and
> set-valued data and tried to use pdist() for calculating the condensed
> distance matrix by passing the new metric functions as intended (i.e.
> using the 'metric' argument).  Unfortunately, it turned out that I had
> to implement my own pdist function as the original pdist's
> functionality is unnecessarily limited to numerical data because of a
> conversion attempt in line 1089 (shown below):
>
> 1088   # The C code doesn't do striding.
> 1089   [X] = _copy_arrays_if_base_present([_convert_to_double(X)])
>
> I understand the need for that conversion but propose that it either
> be moved down (to be applied only in case of distance measures which
> require it) or made possible to switch off (e.g. using a new function
> call parameter with a default value so that the change does not break
> the backward compatibility).  The rationale behind this is that the
> intended purpose of pdist() does not require it to access individual
> values stored in the array - it is the job of a metric function - so
> decoupling pdist from the data would improve its design (and increase
> the number of its potential usage scenarios).
>
> It looks like you could move the _convert_to_double call into each of these
if/elif cases:

        if metric == minkowski:
            def dfun(u, v):
                return minkowski(u, v, p)
        elif metric == wminkowski:
            def dfun(u, v):
                return wminkowski(u, v, p, w)
        elif metric == seuclidean:
            def dfun(u, v):
                return seuclidean(u, v, V)
        elif metric == mahalanobis:
            def dfun(u, v):
                return mahalanobis(u, v, V)

I don't think you can move _copy_arrays_if_base_present(). Could you put
together a patch including a test that covers your use case?

Ralf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-dev/attachments/20110829/fd1e4f3a/attachment.html 


More information about the SciPy-Dev mailing list