[SciPy-Dev] boolean / real-value distance metrics
Fri Jan 6 03:35:02 CST 2012
On Fri, Jan 6, 2012 at 12:37 AM, Jacob VanderPlas <
> Hi all,
> I've been taking a closer look at the various metrics in
> scipy.spatial.distance. In particular, every metric designed for
> boolean values behaves differently depending on whether the function is
> used directly, or cdist/pdist is used (see the example below).
> cdist/pdist first converts the float array to bool, then performs the
> computation. The calls to the metric functions work directly with the
> floating point vectors and yield a different result.
> I've poked around, and haven't found any documentation anywhere that
> addresses this:
> Is this a feature of scipy, or a bug? Which behavior is correct in this
> Are these boolean metrics, when generalized to floating point, true
> metrics? That is, can it be shown that they satisfy the triangle equality?
> I'd like to work on the documentation to make all of this more clear,
> but I don't know where to start... Thanks
> Example code:
> In : from scipy.spatial.distance import cdist, yule
> In : import numpy as np
> In : np.random.seed(0)
> In : x = np.random.random(100)
> In : x[x>0.5] = 0 # set ~half the entries to zero
> In : y = np.random.random(100)
> In : y[y>0.5] = 0 # set half of entries to zero
> In : yule(x, y) # direct computation: this does not convert to bool
> Out: 0.96988390020367443
> In : cdist([x], [y], 'yule')[0, 0] # cdist computation: this does
> convert to bool
> Out: 0.83211678832116787
The boolean dissimilarity functions (such as yule) expect either boolean
arrays or numeric arrays of 0 and 1. They are not meant to be generalized
to arrays of arbitrary floating point values. This is not documented (as
far as I can tell), but it can be inferred from, for example, the
_nbool_correspond_ft_tf function, which is used by some of the
def _nbool_correspond_ft_tf(u, v):
if u.dtype == np.int or u.dtype == np.float_ or u.dtype == np.double:
not_u = 1.0 - u
not_v = 1.0 - v
nft = (not_u * v).sum()
ntf = (u * not_v).sum()
not_u = ~u
not_v = ~v
nft = (not_u & v).sum()
ntf = (u & not_v).sum()
return (nft, ntf)
Note that for a floating point array, not_u is computed as 1.0 - u.
Any improvement of the documentation would certainly be welcome!
Likewise for the code: that test for the dtype of u misses many of the
numeric data types, and the check for np.float_ and np.double is
redundant, since these are both just different names for np.float64.
The separate dissimilarity functions such as yule are implemented
in python, while cdist is a wrapper for C code. The C functions require
a specific data type for their arrays, which is (presumably) why cdist
converts to boolean first. Instead of having a separate calculation for
bool and non-bool arrays, perhaps the dissimilarity functions should
do the same as cdist and simply convert non-bool arrays to boolean.
This would make them consistent with cdist.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the SciPy-Dev