[SciPy-dev] Possible Error in Kendall's Tau (scipy.stats.stats.kendalltau)

Sturla Molden sturla@molden...
Thu Mar 19 06:42:58 CDT 2009

```On 3/19/2009 2:58 AM, josef.pktd@gmail.com wrote:

> I have a hard time following the double loop and slice indices inside
> the double loop. It would be fast and safe on memory since it runs
> through the double loop only once, but hard to read, debug and
> maintain.

We start with a table of counts. Given a pivot with index (j,i), we can
split the table into 8 fields. When counting pairs only once, concordant
pairs are to the lower right and disconcordant to the lower left. Pairs
with tied-y are to the right and pairs with tied x are below.

import numpy as np
from math import sqrt

import numpy as np
from math import sqrt

def taub(table):

def _table(j,i):
return {
#'above' :       table[:j,i],
'below' :       table[j+1:,i],
#'left' :        table[j,:i],
'right' :       table[j,i+1:],
#'upper-left' :  table[:j,:i],
#'upper-right' : table[:j,i+1:],
'lower-left' :  table[j+1:,:i],
'lower-right' : table[j+1:,i+1:] }

D = table.shape[0]
C = table.shape[1]
P = 0
Q = 0
Tx = 0
Ty = 0
for i in range(C):
for j in range(D):

pivot = table[j,i] # use this as pivot
ct = _table(j,i)   # split remainder into 8 sections

# count concordant pairs
# -- multiply pivot with 'lower-right' and summate

P += np.sum(ct['lower-right'] * pivot)

# count disconcordant pairs
# -- multiply pivot with 'lower-left' and summate

Q += np.sum(ct['lower-left'] * pivot)

# count pairs tied in y
# -- multiply pivot with 'right' and summate

Ty += np.sum(ct['right'] * pivot)

# count pairs tied in x
# -- multiply pivot with 'lower' and summate

Tx += np.sum(ct['below'] * pivot)

return float(P-Q)/(sqrt((P+Q+Tx)*(P+Q+Ty)))

It seems I did a mistake in the Fortran. I counted upper-right as
concordant where I should have counted lower-right (!#"¤#"%...):

subroutine taub(C, D, table, tau)
implicit none
intrinsic :: sqrt, sum
integer*4, intent(in) :: C, D
integer*4, dimension(C,D), intent(in) :: table
real*8, intent(out) :: tau
integer*4 :: i, j, pivot, P, Q, Tx, Ty
P = 0
Q = 0
Tx = 0
Ty = 0
!\$omp  parallel do &
!\$omp& private(i,j,tmp) &
!\$omp& reduction(+:P,Q,Tx,Ty) &
!\$omp& shared(table,C,D)
do i = 1,D
do j = 1,C
pivot = table(j,i)
! count concordant pairs
if ((i .lt. D) .and. (j .lt. C)) then
P = P + sum(table(j+1:C,i+1:D)*pivot)
end if
! count disconcordant pairs
if ((i .gt. 1) .and. (j .lt. C)) then
Q = Q + sum(table(j+1:C,1:i-1)*pivot)
end if
! count pairs tied in y
if (i .lt. D) then
Ty = Ty + sum(table(j,i+1:D)*pivot)
end if
! count pairs tied in x
if (j .lt. C) then
Tx = Tx + sum(table(j+1:C,i)*pivot)
end if
end do
end do
!\$omp end parallel do
tau = real(P-Q)/sqrt(real((P+Q+Tx)*(P+Q+Ty)))
end subroutine

> def kendalltau_fromct(x, y, count):
>     '''return tau-a, tau-b and tau-c from contingency table data
>
>     example for contingency table
>     x = np.array([1,1,1,2,2,2])
>     y = np.array([1,2,3,1,2,3])
>     count = np.array([10, 5, 2, 9, 12, 16])
>     '''

It looks good, but it can be terribly slow. E.g. when I run it with say
an contigency table of 100 categories for X and 100 categoried for Y, it
takes for ever and finally raises a MemoryError. Whereas my version with
loops takes just 2 seconds. On the other hand, if  we go furter up to a
100 x 1000 ct, the looped version takes 90 seconds. Yours exits
immediately with 'ValueError: broadcast dimensions too large.'

An other thing is that the input is a bit difficult. Perhaps we could
have flattened count and calculated x and y inside the function? Then it
would just take a 2D array of counts as input.

def kendalltau_fromct(count):
'''return tau-a, tau-b and tau-c from contingency table data

example for contingency table

count = np.array([[10,  5,  2],
[ 9, 12, 16]])

'''

assert(count.ndim == 2)
ny = count.shape[0]
nx = count.shape[1]
x, y = np.meshgrid(range(nx),range(ny))
x = x.flatten()
y = y.flatten()
count = count.flatten()

catx = np.unique(x)
caty = np.unique(y)
ncatx = len(catx)
ncaty = len(caty)

deltax = np.sign(x[:,np.newaxis] - x)
deltay = np.sign(y[:,np.newaxis] - y)

paircount = count[:,np.newaxis]*count - np.diag(count)
# number of concordant minus number of discordant pairs
netconc = np.sum((deltax*deltay)*paircount)

# calculation for tau-a
tau_a = netconc/(1.*paircount.sum())

# calculation for tau-c
m = min(ncatx,ncaty)
tau_c = netconc /(1.*count.sum()**2)* m/(m-1.0)

#extra calculation for tau_b
# row and column counts of contingency table
countx = np.dot(count,(x[:,np.newaxis]==catx))
county = np.dot(count,(y[:,np.newaxis]==caty))
#total number of pairs
npairs = paircount.sum()
#number of ties
ntiex = np.dot(countx,(countx-1))
ntiey = np.dot(county,(county-1))

denom = 1.0*np.sqrt((npairs - ntiex ) * (npairs - ntiey))
tau_b = netconc / denom

return tau_a, tau_b, tau_c

Anyhow, I think Fortran or Cython is really needed here.

Sturla

```