[SciPy-Dev] boolean / real-value distance metrics
Warren Weckesser
warren.weckesser@enthought....
Fri Jan 6 03:35:02 CST 2012
On Fri, Jan 6, 2012 at 12:37 AM, Jacob VanderPlas <
vanderplas@astro.washington.edu> wrote:
> Hi all,
> I've been taking a closer look at the various metrics in
> scipy.spatial.distance. In particular, every metric designed for
> boolean values behaves differently depending on whether the function is
> used directly, or cdist/pdist is used (see the example below).
> cdist/pdist first converts the float array to bool, then performs the
> computation. The calls to the metric functions work directly with the
> floating point vectors and yield a different result.
>
> I've poked around, and haven't found any documentation anywhere that
> addresses this:
> Is this a feature of scipy, or a bug? Which behavior is correct in this
> case?
> Are these boolean metrics, when generalized to floating point, true
> metrics? That is, can it be shown that they satisfy the triangle equality?
>
> I'd like to work on the documentation to make all of this more clear,
> but I don't know where to start... Thanks
> Jake
>
> Example code:
>
> In [1]: from scipy.spatial.distance import cdist, yule
>
> In [2]: import numpy as np
>
> In [3]: np.random.seed(0)
>
> In [4]: x = np.random.random(100)
>
> In [5]: x[x>0.5] = 0 # set ~half the entries to zero
>
> In [6]: y = np.random.random(100)
>
> In [7]: y[y>0.5] = 0 # set half of entries to zero
>
> In [8]: yule(x, y) # direct computation: this does not convert to bool
> Out[8]: 0.96988390020367443
>
> In [9]: cdist([x], [y], 'yule')[0, 0] # cdist computation: this does
> convert to bool
> Out[9]: 0.83211678832116787
>
>
The boolean dissimilarity functions (such as yule) expect either boolean
arrays or numeric arrays of 0 and 1. They are not meant to be generalized
to arrays of arbitrary floating point values. This is not documented (as
far as I can tell), but it can be inferred from, for example, the
_nbool_correspond_ft_tf function, which is used by some of the
dissilimilarity functions:
def _nbool_correspond_ft_tf(u, v):
if u.dtype == np.int or u.dtype == np.float_ or u.dtype == np.double:
not_u = 1.0 - u
not_v = 1.0 - v
nft = (not_u * v).sum()
ntf = (u * not_v).sum()
else:
not_u = ~u
not_v = ~v
nft = (not_u & v).sum()
ntf = (u & not_v).sum()
return (nft, ntf)
Note that for a floating point array, not_u is computed as 1.0 - u.
Any improvement of the documentation would certainly be welcome!
Likewise for the code: that test for the dtype of u misses many of the
numeric data types, and the check for np.float_ and np.double is
redundant, since these are both just different names for np.float64.
The separate dissimilarity functions such as yule are implemented
in python, while cdist is a wrapper for C code. The C functions require
a specific data type for their arrays, which is (presumably) why cdist
converts to boolean first. Instead of having a separate calculation for
bool and non-bool arrays, perhaps the dissimilarity functions should
do the same as cdist and simply convert non-bool arrays to boolean.
This would make them consistent with cdist.
Warren
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-dev/attachments/20120106/6c428f68/attachment.html
More information about the SciPy-Dev
mailing list