[SciPy-User] [SciPy-user] Two Sample Kolmogorov-Smirnov Test scipy vs R

josef.pktd@gmai... josef.pktd@gmai...
Wed Dec 19 09:11:29 CST 2012


On Wed, Dec 19, 2012 at 7:16 AM, amundell <andrewhdmundell@gmail.com> wrote:
>
> I am currently creating a statistical app where I am comparing my hypothesis
> test results with R and Python (scipy) libraries. So far so good with most
> test. However I have found a discrepancy with the R and Python results for
> the Two-Sample Kolmogorov-Smirnov Tests. Below are data vectors I have been
> using obviously formatted for both R ks.test and
> scipy.stat.msstats.ks_twosamp methods.
>
> sample1=[23.4, 30.9, 18.8, 23.0, 21.4, 1, 24.6, 23.8, 24.1, 18.7, 16.3,
> 20.3,
>              14.9, 35.4, 21.6, 21.2, 21.0, 15.0, 15.6, 24.0, 34.6, 40.9,
> 30.7,
>              24.5, 16.6, 1, 21.7, 1, 23.6, 1, 25.7, 19.3, 46.9, 23.3, 21.8,
> 33.3,
>              24.9, 24.4, 1, 19.8, 17.2, 21.5, 25.5, 23.3, 18.6, 22.0, 29.8,
> 33.3,
>              1, 21.3, 18.6, 26.8, 19.4, 21.1, 21.2, 20.5, 19.8, 26.3, 39.3,
> 21.4,
>              22.6, 1, 35.3, 7.0, 19.3, 21.3, 10.1, 20.2, 1, 36.2, 16.7,
> 21.1, 39.1,
>              19.9, 32.1, 23.1, 21.8, 30.4, 19.62, 15.5]
>
> sample2=[16.5, 1, 22.6, 25.3, 23.7, 1, 23.3, 23.9, 16.2, 23.0, 21.6, 10.8,
> 12.2,
>              23.6, 10.1, 24.4, 16.4, 11.7, 17.7, 34.3, 24.3, 18.7, 27.5,
> 25.8, 22.5,
>              14.2, 21.7, 1, 31.2, 13.8, 29.7, 23.1, 26.1, 25.1, 23.4, 21.7,
> 24.4, 13.2,
>              22.1, 26.7, 22.7, 1, 18.2, 28.7, 29.1, 27.4, 22.3, 13.2, 22.5,
> 25.0, 1,
>              6.6, 23.7, 23.5, 17.3, 24.6, 27.8, 29.7, 25.3, 19.9, 18.2,
> 26.2, 20.4,
>              23.3, 26.7, 26.0, 1, 25.1, 33.1, 35.0, 25.3, 23.6, 23.2, 20.2,
> 24.7, 22.6,
>             39.1, 26.5, 22.7]
>
> Running the tests:
> R:
> TT = ks.test(sample1, sample2)
> TG = ks.test(sample1, sample2, alternative="greater")
> TL = ks.test(sample1, sample2, alternative="less")
>
> TT Result: D = 0.2204, p-value = 0.04205   alternative hypothesis: two-sided
> TG Result: D^+ = 0.2204, p-value = 0.02102  alternative hypothesis: the CDF
> of x lies above that of y
> TL Result: D^- = 0.1242, p-value = 0.2933  alternative hypothesis: the CDF
> of x lies below that of y
>
> Scipy:
>
> TT=scipy.stats.mstats.ks_twosamp(sample1, sample2)
> TU=scipy.stats.mstats.ks_twosamp(sample1, sample2, alternative='greater')
> TL=scipy.stats.mstats.ks_twosamp(sample1, sample2, alternative='less')
>
> TT Result: D= 0.220411392405, p-value= 0.0420492678738
> TU Result: D= 0.124208860759 p-value: 0.293327703926
> TL Result: D=: 0.220411392405, p-value: 0.0210248293393
>
> So as it can be seen from the results the one tailed upper and lower values
> seemed to be reversed. In my app my results were more consistent with R's.
> Am I missing something obvious here i.e. with definitions? or is there
> potentially a bug in the scipy code?
> Any help will be much appreciated. Cheers.

It's not really a bug, since the documentation for mstats.ks_2samp
doesn't specify what is meant by greater or less.

But I think this should be clarified in the documentation and changed.
For stats.kstest I followed the R definition for the one-sided tests, IIRC


Aside:

One part that I always find confusing in this is that having a larger
cdf means that the random values are smaller (in a stochastic
dominance sense)
http://en.wikipedia.org/wiki/Stochastic_dominance#First-order_stochastic_dominance
"A dominating B means that F_A(x) <= F_B(x) for all x, with strict
inequality at some x"

However Kolmogorov-Smirnov only looks at the maximum deviation of the
cdfs in either direction.
In your example we have several intersections

import matplotlib.pyplot as plt
plt.figure()
n1 = len(sample1)
n2 = len(sample2)
plt.step(np.sort(sample1), np.arange(1, n1+1)/(n1+1.), label='sample1')
plt.step(np.sort(sample2), np.arange(1, n2+1)/(n2+1.), label='sample2')
plt.legend()

import statsmodels.graphics.gofplots as smgp
fig2 = smgp.qqplot_2samples(np.asarray(sample1)[:-1],
np.asarray(sample2), line='45') #requires equal length
plt.show()

Josef
>
>
>
>
> --
> View this message in context: http://old.nabble.com/Two-Sample-Kolmogorov-Smirnov-Test-scipy-vs-R-tp34814758p34814758.html
> Sent from the Scipy-User mailing list archive at Nabble.com.
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user


More information about the SciPy-User mailing list