[SciPy-Dev] Jaccard distance?

Jake Vanderplas vanderplas@astro.washington....
Fri Dec 7 14:06:07 CST 2012

```Hi,
Jaccard distance is a dissimilarity metric between two sets.  I think
the confusion here is how the sets are specified.  The definition of
Jaccard distance is:

J(A, B) = 1 -  |A intersect B| / |A union B|

so if A = {1, 2, 3, 4} and B = {1, 2, 4, 3},
then J(A, B) = 0.  Recall that order does not matter for sets.

Sets can also be encoded as an ordered list of binary variables: the
above sets (zero-indexed) could
be represented by A = [0, 1, 1, 1, 1]; B = [0, 1, 1, 1, 1].  In this
case, order matters, and the distance can be specified by

J_bin(A, B) = N_unequal(A, B) / N_nonzero(A, B)

and we recover J_bin(A, B) = 0 as above.

As a more complicated example, if A = [1, 0, 0, 1] and B = [1, 0, 1, 0],
then
N_unequal(A, B) = 2
N_nonzero(A, B) = 3 (only a single index has zero for both A and B)
and
J_bin(A, B) = 2/3

Where things get a bit messy is that scipy & octave extend this binary
notion of the Jaccard distance to arbitrary numbers. So if A = [1, 2, 3,
4] and B = [1, 2, 4, 3], then
N_unequal(A, B) = 2
N_nonzero(A, B) = 4
and
J_ext(A, B) = 1/2

I'm not sure whether this results in a true metric - that would be
interesting to figure out.

This seems to be the issue in ticket 1774: the user expected the metric
to operate as if A and B are unordered sets, while the function actually
operates as if they're ordered lists of (extended) binary variables.

Hope that helps,
Jake

On 12/07/2012 11:36 AM, Pauli Virtanen wrote:
> Hi,
>
> Does someone know about what is the "correct" definition for the Jaccard
> distance?
>
> There's a bug report that claims that the current behavior is wrong:
>
>      http://projects.scipy.org/scipy/ticket/1774
>
> However, as far as I see, the result is exactly what the docstring says
> and our result agrees with Octave.
>

```