[SciPy-dev] Possible Error in Kendall's Tau (scipy.stats.stats.kendalltau)

josef.pktd@gmai... josef.pktd@gmai...
Wed Mar 18 12:16:59 CDT 2009


On Wed, Mar 18, 2009 at 12:30 PM, Sturla Molden <sturla@molden.no> wrote:
> On 3/18/2009 4:55 PM, Sturla Molden wrote:
>
>> The ide is that Kendall's tau works on ordinal scale, not rank scale as
>> Spearman's r. You can use the number of categories for X and Y you like,
>> but the categories have to be ordered. You thus get a table of counts.
>> If you for example use two categories (small or big) in X and four
>> categories (tiny, small, big, huge) in Y, the table is 2 x 4. If you go
>> all the way up to rank scale, you get a very sparse table with a lot 0
>> counts. With few categories, ties will be quite common, and that is the
>> justification for tau-b instead of gamma.

I got confused what the statement "requires ordinal scale" means.

I think the clearer statement should be: requires at least ordinal scale,
the categories have to be ordered and not unordered as in "male", "female".
Kendall tau uses only ordinal information even if the variable is metric like
a continuous variable.

It didn't catch my attention before that Spearman's r uses a rank
scale and not a (ordinal) rank ordering.

>
> One very important aspect of this is that it can reduce the
> computational burden substantially. If you e.q. know that 100 categories
> is sufficient resolution, you get a 100 x 100 contigency table. tau-b
> can be computed directly from the table. So for large data sets, this
> avoids the O(N**2) complexity of tau. The complexity of tau-b becomes
> O(N) and O(C*D), with C and D the number of categories in X and Y.
>
> So having a contingency-table version of tau-b would be very useful.

You could still have O(C*D) > O(N**2), if the table is sparse, and you
haven't deleted the empty rows and columns.

If you have 100 categories for a variable, do you still have to treat
it as an ordinal variable, I would expect that using statistics for
continuous variables should produce almost the same results.

I think there should be some general tools or tricks to work on
contingency tables, to have a common pattern for working with them.
But, I never used them much, so it would take me quite some time to
figure out how to do this efficiently. I'm more used to continuous
variable and maybe a few dummy variables.

I was looking at categorical variables for regression and for ANOVA
and there it is a similar story for the size of the matrices.
When I create a dummy variable for each category combination, then in
your case, I would have a matrix of dummy variables in the size of
(number of observation)*100*100. If the category variables are race
and sex and age group, then the dimension would be much smaller.

In the case for a small number of categories, everything can be
written using simple linear algebra (broadcasting and dot products all
over) which is very fast, but would require more memory if the number
of categories is really large.

Given that there are so many different application cases for
statistics, choosing an implementation that satisfies most, looks
pretty difficult, and requires feedback, contributions and time to
actually do it.

Josef


More information about the Scipy-dev mailing list