[SciPy-dev] Possible Error in Kendall's Tau (scipy.stats.stats.kendalltau)
Almer S. Tigelaar
almer@gnome....
Tue Mar 17 14:05:49 CDT 2009
Hello,
(I realize this mail is a bit lengthy, but I would appreciate it if someone could comment on it).
I believe that I found a bug in your implementation of Kendall's Tau. I
have evaluated the implementation (to verify a self-written
implementation). When the results turned out to be different I
investigated the current SciPy implementation at the following URL:
http://svn.scipy.org/svn/scipy/trunk/scipy/stats/stats.py
(I am aware of the fact that there is also a Kendall's Tau implementation in mstats.py, but
have not evaluated that implementation yet).
I will give some explanation of my interpretation of Kendall's tau, an
example showing the differences between SciPy's and my implementation and a
possible fix for SciPy's implementation.
Your implementation is Kendall's tau-b with tie correction (same as
mine). I take as my reference definition, the one in the following
'poster' paper:
http://portal.acm.org/citation.cfm?id=1277935
(this same definition appears in other places as well, this is the
shortest resource I could find)
Recall that Kendall's tau calculates a score t given two rankings R1 and
R2. Variables P, Q and T are all characteristics of the pairs in those
rankings.
The definition given in the reference is:
t = (P - Q) / SQRT((P + Q + T) * (P + Q + U))
where P is the number of concordant pairs, Q the number of discordant
pairs, T the number ties in R1 and U the number of ties in R2.
An example:
-----------
Let's use two identical rankings with a tie:
A B C
R1 = [1, 1, 2]
R2 = [1, 1, 2]
There are three pair combinations in these lists, namely: (A, B), (A, C)
and (B, C). It is obvious that _one_ of these combinations has a tie for
both lists (the (A,B) combination which is (1,1) for both R1 and R2).
So, since there is one tie in both list we have T = U = 1
We find that there are two concordant pairs in both lists (A, C) and
(B,C) so P = 2. There are no discordant pairs, so Q = 0. With all
variables given, we can now calculate Kendall's tau for R1 and R2:
t = (2 - 0) / SQRT((2 + 0 + 1)*(2 + 0 + 1))
t = 2 / SQRT(3*3)
t = 2 / 3
t = 0.6666666
However, using scipy (svn HEAD) as follows:
import scipy.stats.stats as s
s.kendalltau([1,1,2], [1,1,2])
Yields t = 1.0:
(1.0, 0.11718509694604401)
Which I believe is wrong (or at least: has no correction for ties, as is
claimed in the source code). If there are three combinations and one of
these is a tie, and the other two combinations are concordant, it makes
sense that Kendall's tau-b should yield 2 / 3.
The cause and fix
-----------------
Playing around with SciPy's code (and comparing it with my own) I believe I
discovered a probable cause for this difference in SciPy's code. Again, I used the
implementation at the following URL:
http://svn.scipy.org/svn/scipy/trunk/scipy/stats/stats.py
(please take look at the implementation first, otherwise you will not
understand my explanation)
In the 'kendalltau(x,y)' function we see a test for ties and an 'else'
branch. In the 'else' branch the values of 'n1' and 'n2' are incremented
if there is a tie (conforming to +T and +U in the formula given above).
However, I believe that the 'if' conditions here are wrong:
1) Consider that if 'a1' has value '0' it is tied (the same goes for
'a2'). In the else branch I see:
if a1:
n1 = n1 + 1
if a2:
n2 = n2 + 1
So, here the addition takes places on the variables (n1, n2) if there is
NO tie, instead of if there is a tie. Hence, this explains the different
outcome. Translating this back to the formula gives me T = U = 0, which
would yield:
t = (2 - 0) / SQRT((2 + 0 + 0)*(2 + 0 + 0))
t = 2 / SQRT(2*2)
t = 2 / 2
t = 1.0
Which is indeed consistent with the SciPy outcome. Henceforth, I believe
the solution to this is to correct the condition in the if statements in
the Kendall's tau function:
if not a1:
n1 = n1 + 1
if not a2:
n2 = n2 + 1
Closing
-------
Of course, my interpretation of Kendall's Tau could be wrong. Since I
can not exclude that possibility I would appreciate it if one of you could
check and see if you reach the same conclusion. Maybe the base formula that
SciPy uses is different.
I have compared your implementation also to that implemented in the R
project, however their source code suggests that they do not adjust for
ties (effectively implementing Kendall's tau-a).
--
With kind regards,
Almer S. Tigelaar
University of Twente
More information about the Scipy-dev
mailing list