[SciPy-User] stats.ranksums vs. stats.mannwhitneyu

josef.pktd@gmai... josef.pktd@gmai...
Wed Oct 10 10:18:18 CDT 2012


On Wed, Oct 10, 2012 at 8:59 AM, Nils Kölling <nkoelling@gmail.com> wrote:
> Thank you for your reply, Josef! Is there any reason you are
> calculating the test manually in your code instead of using
> scipy.stats.kruskal?

I also got a trial version for mannwhitneyu
https://gist.github.com/3866149

The main reason to use function specific permutation is that some of
the calculations stay the same for each permutation, especially
rankdata can be slow.

generic permutation is more flexible but I expect it to be slower.

>
> I have written my own version for permutation-based p-values using
> stats.mannwhitneyu now and ran a few trials. Here is what I get for:
>
> a=8*[0]
> b=n*[1]
>
> n = 1  - normal = 0.0133283287808  / permuted = 0.109775608976
> n = 2  - normal = 0.00491580235039  / permuted = 0.0232390704372
> n = 3  - normal = 0.00244136177941  / permuted = 0.00559977600896
> n = 4  - normal = 0.00131365315366  / permuted = 0.00185992560298
> n = 5  - normal = 0.000731481991814  / permuted = 0.000719971201152
> n = 6  - normal = 0.000414875963454  / permuted = 0.000539978400864
> n = 7  - normal = 0.000237996579543  / permuted = 0.00019999200032
> n = 8  - normal = 0.000137586057166  / permuted = 0.000159993600256
> n = 9  - normal = 7.99851933706e-05  / permuted = 7.9996800128e-05
>
> So if we assume that the permuted p-value is the "true" value, it
> seems like one could get away with just using the normal,
> non-permutation based version for n >= 5, since the permuted value
> does not differ much from the normal one anymore. What do you think?

I tried mainly the n1=5, n2=25 case, and I also see only small
differences between normal distribution pvalues and permutation
pvalues. The difference for kruskal was also small.

One possibility is that, if the data comes from a "very non-normal"
distribution, then the difference might be larger, but I haven't tried
yet.

If someone really wants to use hard thresholds like alpha=0.05, then
small differences might give different results, for example in my
generated example:

two sided pvalue from normal approximation, and permutations
27.0 0.0514504675812 0.0454

(but I don't think it should make much difference in our conclusions
if we have 0.051 or 0.045.)

Cheers,

Josef

>
> Cheers
>
> Nils


More information about the SciPy-User mailing list