[SciPy-dev] Possible Error in Kendall's Tau (scipy.stats.stats.kendalltau)

josef.pktd@gmai... josef.pktd@gmai...
Wed Mar 18 11:35:03 CDT 2009

On Wed, Mar 18, 2009 at 11:59 AM, Almer S. Tigelaar <almer@gnome.org> wrote:
> Hi Josef,
> On Wed, 2009-03-18 at 11:12 -0400, josef.pktd@gmail.com wrote:
>> I'm giving up, this takes too much time:
> Okay, we can simply conclude that there are some conflicting
> interpretations of Kendall's Tau-b. Both, I believe, are defensible. In
> such cases one is best off choosing just one approach and making clear
> what it is.
> So, I would simply say in the function documentation precisely the
> definition that you use (also for ties) and your motivation relating to
> the correlation interpretation (which is indeed pretty convincing).
> You can use my template Kendall tau-b definition from this post for
> that:
> http://mail.scipy.org/pipermail/scipy-dev/2009-March/011569.html
> Then at least there will be no misunderstanding about what the function
> is supposed (and will) do. If people then disagree (and want to use the
> other interpretation) then they can copy the function and adjust it to
> their wishes.
> Thanks for your rigorous testing and general effort on this. I
> appreciate it, very useful for what I am working on.

The proliferation of different version of a statistic was the reason
that I wasn't able to verify kendall's tau before.
I think, there should be a clear theoretical foundation and
interpretation and not just twisting the tie handling a bit.

For example spearman's r: the calculation is based on a short hand
formula that only works when there are no ties. If there are ties, the
discussion starts how to handle them. But, if you go back to the
definition as correlation of the rank ordering implied by the data
then we can just use the standard correlation coefficient on the
rankdata and we don't have to worry about tie handling.

another case:
What's the point of the pointbiserialr(x, y), it's just the
correlation between a binary and a continuous variable.
It has a nice explicit formula to calculate it (almost) by hand. But
using a computer we can just use np.corrcoef
and don't have to worry about special functions.

I think some of these formulas are left-over from the time before we
had fast computers.


More information about the Scipy-dev mailing list