[SciPy-Dev] Jaccard distance?
Jake Vanderplas
vanderplas@astro.washington....
Fri Dec 7 14:06:07 CST 2012
Hi,
Jaccard distance is a dissimilarity metric between two sets. I think
the confusion here is how the sets are specified. The definition of
Jaccard distance is:
J(A, B) = 1 - |A intersect B| / |A union B|
so if A = {1, 2, 3, 4} and B = {1, 2, 4, 3},
then J(A, B) = 0. Recall that order does not matter for sets.
Sets can also be encoded as an ordered list of binary variables: the
above sets (zero-indexed) could
be represented by A = [0, 1, 1, 1, 1]; B = [0, 1, 1, 1, 1]. In this
case, order matters, and the distance can be specified by
J_bin(A, B) = N_unequal(A, B) / N_nonzero(A, B)
and we recover J_bin(A, B) = 0 as above.
As a more complicated example, if A = [1, 0, 0, 1] and B = [1, 0, 1, 0],
then
N_unequal(A, B) = 2
N_nonzero(A, B) = 3 (only a single index has zero for both A and B)
and
J_bin(A, B) = 2/3
Where things get a bit messy is that scipy & octave extend this binary
notion of the Jaccard distance to arbitrary numbers. So if A = [1, 2, 3,
4] and B = [1, 2, 4, 3], then
N_unequal(A, B) = 2
N_nonzero(A, B) = 4
and
J_ext(A, B) = 1/2
I'm not sure whether this results in a true metric - that would be
interesting to figure out.
This seems to be the issue in ticket 1774: the user expected the metric
to operate as if A and B are unordered sets, while the function actually
operates as if they're ordered lists of (extended) binary variables.
Hope that helps,
Jake
On 12/07/2012 11:36 AM, Pauli Virtanen wrote:
> Hi,
>
> Does someone know about what is the "correct" definition for the Jaccard
> distance?
>
> There's a bug report that claims that the current behavior is wrong:
>
> http://projects.scipy.org/scipy/ticket/1774
>
> However, as far as I see, the result is exactly what the docstring says
> and our result agrees with Octave.
>
More information about the SciPy-Dev
mailing list