[SciPy-user] rewriting stats.spearmanr
josef.pktd@gmai...
josef.pktd@gmai...
Fri Jan 2 23:29:14 CST 2009
spearmanr in scipy.stats does not handle ties correctly.
I was looking at a way to fix it, and ended up instead with a complete
rewrite. The main difference to the current version is that it can
return a correlation matrix for several variables at the same time.
This came pretty cheap, because, instead of using the old shortcut
formula for Spearmans rho, I just use np.corrcoef. Calculating the
correlation matrix takes 3 lines, but as usual dimension handling and
tests scripts take several times more time and lines than the function
itself.
Results are verified with R (through rpy) and are the same to 15,16
digits for both integer variables with ties and continuous variables
without ties, although R has more options and has exact test
statistic.
I could keep the API completely consistent with the current version,
but I would like to return also the test-statistic, and not just the
p-value, this
would, however, require to return a 3-tuple instead of a 2-tuple.
new signature is: spearmanr(a, b=None, axis=0):
Notes are below and new function and test scripts are in attachment.
Comments?
Josef
Notes
-----
main changes to existing stats.spearmanr
* correct tie handling
* calculates correlation matrix instead of only single correlation
coefficient,
similar to np.corrcoef but using keyword argument axis=0 (default)
* returns also t-statistic (can be dropped for backwards compatibility)
* open question, zero division
>>> stats.spearmanr([1,1,1,1],[2,2,2,2])
(1.0, 0.0)
>>> spearmanr([1,1,1,1],[2,2,2,2])
(-1.#IND, -1.#IND, 0.0)
>>> np.corrcoef([1,1,1,1],[2,2,2,2])
array([[ NaN, NaN],
[ NaN, NaN]])
comparison to stats.mstats.spearmanr
* both have correct tie handling
* mstats.spearmanr
- ravels if more than 1 variable per array
- calculates only one correlation coefficient, no correlation matrix
- uses masked arrays
difference to np.corrcoef
* using keyword argument axis=0 (default), instead of rowvar=1
* returns one correlation coefficient for two variables, instead of
2 by 2 matrix
comparison to R
* identical correlation matrix if only one array given
* if 2 arrays are given, then R only returns cross-correlation
* p-value is the same as in R with exact=False
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: spearmanr_rewrite.py
Url: http://projects.scipy.org/pipermail/scipy-user/attachments/20090103/61002027/attachment.pl
More information about the SciPy-user
mailing list