[SciPy-Dev] scipy.spatial.distance.pdist - unnecessarily limited functionality and a suggestion of a possible solution

Tomasz J. Kotarba tomasz@kotarba....
Sat Aug 27 07:52:09 CDT 2011

I would like to suggest a small improvement to pdist().  I have just
implemented some new similarity/distance metrics which work with
heterogeneous arrays containing both numerical, categorical and
set-valued data and tried to use pdist() for calculating the condensed
distance matrix by passing the new metric functions as intended (i.e.
using the 'metric' argument).  Unfortunately, it turned out that I had
to implement my own pdist function as the original pdist's
functionality is unnecessarily limited to numerical data because of a
conversion attempt in line 1089 (shown below):

1088   # The C code doesn't do striding.
1089   [X] = _copy_arrays_if_base_present([_convert_to_double(X)])

I understand the need for that conversion but propose that it either
be moved down (to be applied only in case of distance measures which
require it) or made possible to switch off (e.g. using a new function
call parameter with a default value so that the change does not break
the backward compatibility).  The rationale behind this is that the
intended purpose of pdist() does not require it to access individual
values stored in the array - it is the job of a metric function - so
decoupling pdist from the data would improve its design (and increase
the number of its potential usage scenarios).


More information about the SciPy-Dev mailing list