[SciPy-dev] Possible Error in Kendall's Tau (scipy.stats.stats.kendalltau)

Almer S. Tigelaar almer@gnome....
Tue Mar 17 14:05:49 CDT 2009


Hello,

(I realize this mail is a bit lengthy, but I would appreciate it if someone could comment on it).

I believe that I found a bug in your implementation of Kendall's Tau. I
have evaluated the implementation (to verify a self-written
implementation). When the results turned out to be different I
investigated the current SciPy implementation at the following URL:
http://svn.scipy.org/svn/scipy/trunk/scipy/stats/stats.py

(I am aware of the fact that there is also a Kendall's Tau implementation in mstats.py, but
 have not evaluated that implementation yet).

I will give some explanation of my interpretation of Kendall's tau, an
example showing the differences between SciPy's and my implementation and a
possible fix for SciPy's implementation.

Your implementation is Kendall's tau-b with tie correction (same as
mine). I take as my reference definition, the one in the following
'poster' paper:
http://portal.acm.org/citation.cfm?id=1277935
(this same definition appears in other places as well, this is the
shortest resource I could find)

Recall that Kendall's tau calculates a score t given two rankings R1 and
R2. Variables P, Q and T are all characteristics of the pairs in those
rankings.

The definition given in the reference is:
	t = (P - Q) / SQRT((P + Q + T) * (P + Q + U))
where P is the number of concordant pairs, Q the number of discordant
pairs, T the number ties in R1 and U the number of ties in R2.

An example:
-----------
Let's use two identical rankings with a tie:
	      A  B  C
	R1 = [1, 1, 2]
	R2 = [1, 1, 2]

There are three pair combinations in these lists, namely: (A, B), (A, C)
and (B, C). It is obvious that _one_ of these combinations has a tie for
both lists (the (A,B) combination which is (1,1) for both R1 and R2).
So, since there is one tie in both list we have T = U = 1

We find that there are two concordant pairs in both lists (A, C) and
(B,C) so P = 2. There are no discordant pairs, so Q = 0. With all
variables given, we can now calculate Kendall's tau for R1 and R2:

	t = (2 - 0) / SQRT((2 + 0 + 1)*(2 + 0 + 1))
	t = 2 / SQRT(3*3)
	t = 2 / 3
	t = 0.6666666

However, using scipy (svn HEAD) as follows:

	import scipy.stats.stats as s
	s.kendalltau([1,1,2], [1,1,2])

Yields t = 1.0:

	(1.0, 0.11718509694604401)

Which I believe is wrong (or at least: has no correction for ties, as is
claimed in the source code). If there are three combinations and one of
these is a tie, and the other two combinations are concordant, it makes
sense that Kendall's tau-b should yield 2 / 3.

The cause and fix
-----------------
Playing around with SciPy's code (and comparing it with my own) I believe I
discovered a probable cause for this difference in SciPy's code. Again, I used the
implementation at the following URL:
http://svn.scipy.org/svn/scipy/trunk/scipy/stats/stats.py
(please take look at the implementation first, otherwise you will not
understand my explanation)

In the 'kendalltau(x,y)' function we see a test for ties and an 'else'
branch. In the 'else' branch the values of 'n1' and 'n2' are incremented
if there is a tie (conforming to +T and +U in the formula given above).
However, I believe that the 'if' conditions here are wrong:
1) Consider that if 'a1' has value '0' it is tied (the same goes for
'a2'). In the else branch I see:

	if a1:
        	n1 = n1 + 1
	if a2:
		n2 = n2 + 1

So, here the addition takes places on the variables (n1, n2) if there is
NO tie, instead of if there is a tie. Hence, this explains the different
outcome. Translating this back to the formula gives me T = U = 0, which
would yield:
	
	t = (2 - 0) / SQRT((2 + 0 + 0)*(2 + 0 + 0))
	t = 2 / SQRT(2*2)
	t = 2 / 2
	t = 1.0

Which is indeed consistent with the SciPy outcome. Henceforth, I believe
the solution to this is to correct the condition in the if statements in
the Kendall's tau function:

	if not a1:
	        n1 = n1 + 1
	if not a2:
		n2 = n2 + 1

Closing
-------
Of course, my interpretation of Kendall's Tau could be wrong. Since I
can not exclude that possibility I would appreciate it if one of you could
check and see if you reach the same conclusion. Maybe the base formula that
SciPy uses is different.

I have compared your implementation also to that implemented in the R
project, however their source code suggests that they do not adjust for
ties (effectively implementing Kendall's tau-a).

-- 
With kind regards,

Almer S. Tigelaar
University of Twente



More information about the Scipy-dev mailing list