[SciPy-dev] Possible Error in Kendall's Tau (scipy.stats.stats.kendalltau)

josef.pktd@gmai... josef.pktd@gmai...
Tue Mar 17 15:41:39 CDT 2009


On Tue, Mar 17, 2009 at 3:05 PM, Almer S. Tigelaar <almer@gnome.org> wrote:
> Hello,
>
> (I realize this mail is a bit lengthy, but I would appreciate it if someone could comment on it).
>
> I believe that I found a bug in your implementation of Kendall's Tau. I
> have evaluated the implementation (to verify a self-written
> implementation). When the results turned out to be different I
> investigated the current SciPy implementation at the following URL:
> http://svn.scipy.org/svn/scipy/trunk/scipy/stats/stats.py
>
> (I am aware of the fact that there is also a Kendall's Tau implementation in mstats.py, but
>  have not evaluated that implementation yet).
>
> I will give some explanation of my interpretation of Kendall's tau, an
> example showing the differences between SciPy's and my implementation and a
> possible fix for SciPy's implementation.
>
> Your implementation is Kendall's tau-b with tie correction (same as
> mine). I take as my reference definition, the one in the following
> 'poster' paper:
> http://portal.acm.org/citation.cfm?id=1277935
> (this same definition appears in other places as well, this is the
> shortest resource I could find)
>
> Recall that Kendall's tau calculates a score t given two rankings R1 and
> R2. Variables P, Q and T are all characteristics of the pairs in those
> rankings.
>
> The definition given in the reference is:
>        t = (P - Q) / SQRT((P + Q + T) * (P + Q + U))
> where P is the number of concordant pairs, Q the number of discordant
> pairs, T the number ties in R1 and U the number of ties in R2.
>
> An example:
> -----------
> Let's use two identical rankings with a tie:
>              A  B  C
>        R1 = [1, 1, 2]
>        R2 = [1, 1, 2]
>
> There are three pair combinations in these lists, namely: (A, B), (A, C)
> and (B, C). It is obvious that _one_ of these combinations has a tie for
> both lists (the (A,B) combination which is (1,1) for both R1 and R2).
> So, since there is one tie in both list we have T = U = 1
>
> We find that there are two concordant pairs in both lists (A, C) and
> (B,C) so P = 2. There are no discordant pairs, so Q = 0. With all
> variables given, we can now calculate Kendall's tau for R1 and R2:
>
>        t = (2 - 0) / SQRT((2 + 0 + 1)*(2 + 0 + 1))
>        t = 2 / SQRT(3*3)
>        t = 2 / 3
>        t = 0.6666666
>
> However, using scipy (svn HEAD) as follows:
>
>        import scipy.stats.stats as s
>        s.kendalltau([1,1,2], [1,1,2])
>
> Yields t = 1.0:
>
>        (1.0, 0.11718509694604401)
>
> Which I believe is wrong (or at least: has no correction for ties, as is
> claimed in the source code). If there are three combinations and one of
> these is a tie, and the other two combinations are concordant, it makes
> sense that Kendall's tau-b should yield 2 / 3.
>
> The cause and fix
> -----------------
> Playing around with SciPy's code (and comparing it with my own) I believe I
> discovered a probable cause for this difference in SciPy's code. Again, I used the
> implementation at the following URL:
> http://svn.scipy.org/svn/scipy/trunk/scipy/stats/stats.py
> (please take look at the implementation first, otherwise you will not
> understand my explanation)
>
> In the 'kendalltau(x,y)' function we see a test for ties and an 'else'
> branch. In the 'else' branch the values of 'n1' and 'n2' are incremented
> if there is a tie (conforming to +T and +U in the formula given above).
> However, I believe that the 'if' conditions here are wrong:
> 1) Consider that if 'a1' has value '0' it is tied (the same goes for
> 'a2'). In the else branch I see:
>
>        if a1:
>                n1 = n1 + 1
>        if a2:
>                n2 = n2 + 1
>
> So, here the addition takes places on the variables (n1, n2) if there is
> NO tie, instead of if there is a tie. Hence, this explains the different
> outcome. Translating this back to the formula gives me T = U = 0, which
> would yield:
>
>        t = (2 - 0) / SQRT((2 + 0 + 0)*(2 + 0 + 0))
>        t = 2 / SQRT(2*2)
>        t = 2 / 2
>        t = 1.0
>
> Which is indeed consistent with the SciPy outcome. Henceforth, I believe
> the solution to this is to correct the condition in the if statements in
> the Kendall's tau function:
>
>        if not a1:
>                n1 = n1 + 1
>        if not a2:
>                n2 = n2 + 1


Yes, comparing this part of the function with your reference, your
change is correct. I prefer an explicit comparison with 0, because
it's much easier to read than thinking about the truth value of a
number
        if a1 == 0:
                n1 = n1 + 1
        if a2 == 0:
                n2 = n2 + 1

This is already the third problem with tie handling, so I am not surprised.

Note, however that this only creates the incorrect count if both
vectors have a tie for exactly the same pair. If only one of the two
variables has a tie, then it increases n of the other variable, either
n1 if a2 ==0, or n2 if a1 ==0.

So, since only matching ties are not counted in the calculation of
kendalls tau, this might be up to interpretation whether two identical
vectors should have correlation 1 or something smaller, reduced by
matching tie correction.


>
> Closing
> -------
> Of course, my interpretation of Kendall's Tau could be wrong. Since I
> can not exclude that possibility I would appreciate it if one of you could
> check and see if you reach the same conclusion. Maybe the base formula that
> SciPy uses is different.
>
> I have compared your implementation also to that implemented in the R
> project, however their source code suggests that they do not adjust for
> ties (effectively implementing Kendall's tau-a).
>
> --
> With kind regards,
>
> Almer S. Tigelaar
> University of Twente
>

Thank you for checking and reporting this.

I still need to look at a it bit more closely, and find some correct
tests, but I will change it withing a few days.
The problem, I had with Kendalls tau was that I didn't find a good,
non-ambiguous reference, also with the hints to different versions of
kendalls tau, it wasn't clear to me what exactly is implemented or how
the different versions are defined.

Do you also know a reference for the variance that is used in
calculating the p-value, so we can also verify it?
I'm a bit surprised that R doesn't do the tie handling.

Josef


More information about the Scipy-dev mailing list