[SciPy-dev] Possible Error in Kendall's Tau (scipy.stats.stats.kendalltau)
josef.pktd@gmai...
josef.pktd@gmai...
Tue Mar 17 18:11:41 CDT 2009
On Tue, Mar 17, 2009 at 6:10 PM, <josef.pktd@gmail.com> wrote:
> Hollander, M., and D. A. Wolfe. 1999. Nonparametric statistical methods
> is supposed to have a discussion on tie handling for kendall's tau,
> but I don't have access to it.
>
> Searching some references again, I still get only ambiguous answers,
> whether matching ties should be counted or not.
>
> I guess we can stick with the current implementation if it produces
> the same result as numerical recipes.
> And I like it better if a vector has correlation 1 with itself.
>
> But I found a verification for the calculation of the variance and the pvalue.
>
> Josef
>
I saw it mentioned somewhere, that Kendall's tau is the correlation
coefficient of pairwise ranking indicators. I tried to see if I can
get this, and the version below exactly replicates the current
implementation for the test examples.
So similar to spearman and the other correlation statistics, we just
need to construct the right transformation to get a nice correlation
interpretation back.
I think, this wouldn't hold if we don't exclude matching ties in the
counts for the denominator as is done with the current implementation.
Josef
import numpy as np
from scipy import stats
from numpy.testing import assert_equal
def kendalltaustat(x, y):
'''calculate Kendall's tau-b correlation statistic
this is just the (non-central) correlation of all pairwise rankings
'''
# calculate indicators of all pairs
ppos1 = np.sign((x[:,np.newaxis] - x)).astype(float).ravel()
ppos2 = np.sign((y[:,np.newaxis] - y)).astype(float).ravel()
#correlation coefficient without mean correction
tau = np.dot(ppos1,ppos2) / np.sqrt(np.dot(ppos1,ppos1) *
np.dot(ppos2,ppos2))
return tau
x1a = np.array([0, 1, 3, 3, 4, 5, 5, 7, 8])
x1b = np.array([1, 3, 3, 4, 5, 5, 7, 8, 9])
x1c = np.array([1, 3, 3, 3, 5, 5, 7, 8, 9])
x1 = np.array([1,1,2])
x2a = np.array([1,1,2,2])
x2b = np.array([1,2,3,4])
data = [(x1a,x1a),
(x1a,x1b),
(x1a,x1b),
(x1,x1),
(x2a,x2b),
(x2b,x2b),
(x2a,3-x2a),
(x2b,3-x2b)]
for x,y in data:
t1 = kendalltaustat(x,y),
ts, ps = stats.kendalltau(x,y)
print t1, (ts, ps)
assert_equal(t1,ts)
for i in range(10):
x = np.random.randn(20)
y = np.random.randn(20)
t1 = kendalltaustat(x,y),
ts, ps = stats.kendalltau(x,y)
print t1, (ts, ps)
assert_equal(t1,ts)
More information about the Scipy-dev
mailing list