[SciPy-user] rewriting stats.spearmanr

josef.pktd@gmai... josef.pktd@gmai...
Fri Jan 2 23:29:14 CST 2009


spearmanr in scipy.stats does not handle ties correctly.

I was looking at a way to fix it, and ended up instead with a complete
rewrite. The main difference to the current version is that it can
return a correlation matrix for several variables at the same time.
This came pretty cheap, because, instead of using the old shortcut
formula for Spearmans rho, I just use np.corrcoef. Calculating the
correlation matrix takes 3 lines, but as usual dimension handling and
tests scripts take several times more time and lines than the function
itself.

Results are verified with R (through rpy) and are the same to 15,16
digits for both integer variables with ties and continuous variables
without ties, although R has more options and has exact test
statistic.

I could keep the API completely consistent with the current version,
but I would like to return also the test-statistic, and not just the
p-value, this
would, however, require to return a 3-tuple instead of a 2-tuple.

new signature is: spearmanr(a, b=None, axis=0):

Notes are below and new function and test scripts are in attachment.

Comments?

Josef



    Notes
    -----

    main changes to existing stats.spearmanr
    * correct tie handling
    * calculates correlation matrix instead of only single correlation
      coefficient,
      similar to np.corrcoef but using keyword argument axis=0 (default)
    * returns also t-statistic (can be dropped for backwards compatibility)
    * open question, zero division
        >>> stats.spearmanr([1,1,1,1],[2,2,2,2])
        (1.0, 0.0)
        >>> spearmanr([1,1,1,1],[2,2,2,2])
        (-1.#IND, -1.#IND, 0.0)
        >>> np.corrcoef([1,1,1,1],[2,2,2,2])
        array([[ NaN,  NaN],
               [ NaN,  NaN]])

    comparison to stats.mstats.spearmanr
    * both have correct tie handling
    * mstats.spearmanr
      - ravels if more than 1 variable per array
      - calculates only one correlation coefficient, no correlation matrix
      - uses masked arrays

    difference to np.corrcoef
    * using keyword argument axis=0 (default), instead of rowvar=1
    * returns one correlation coefficient for two variables, instead of
      2 by 2 matrix

    comparison to R
    * identical correlation matrix if only one array given
    * if 2 arrays are given, then R only returns cross-correlation
    * p-value is the same as in R with exact=False
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: spearmanr_rewrite.py
Url: http://projects.scipy.org/pipermail/scipy-user/attachments/20090103/61002027/attachment.pl 


More information about the SciPy-user mailing list