[SciPy-User] [SciPy-user] Two Sample Kolmogorov-Smirnov Test scipy vs R

amundell andrewhdmundell@gmail....
Wed Dec 19 12:06:39 CST 2012


Hi Josef,

Thanks for your quick response and information. It does seem a little
confusing as R uses these parameters names as well, but I guess
documentations can clarify this. Incidentally, I did make some comparison
tests with scipy.stat.kstest, with SciPy results appearing more accurate as
confirmed by a third statistical software package. In the two sample test R
and the third software matched closely with their tail data. I am quite new
to working with this type of test and also a little confused with the
meanings of "greater" and "less" in this scenario. 

I am going to make some further investigations into the test theory as well
as the algorithms being used in R and other packages and see if I can come
up with an answer. Will keep you posted.

Thanks again,

Andrew



josef.pktd wrote:
> 
> On Wed, Dec 19, 2012 at 7:16 AM, amundell <andrewhdmundell@gmail.com>
> wrote:
>>
>> I am currently creating a statistical app where I am comparing my
>> hypothesis
>> test results with R and Python (scipy) libraries. So far so good with
>> most
>> test. However I have found a discrepancy with the R and Python results
>> for
>> the Two-Sample Kolmogorov-Smirnov Tests. Below are data vectors I have
>> been
>> using obviously formatted for both R ks.test and
>> scipy.stat.msstats.ks_twosamp methods.
>>
>> sample1=[23.4, 30.9, 18.8, 23.0, 21.4, 1, 24.6, 23.8, 24.1, 18.7, 16.3,
>> 20.3,
>>              14.9, 35.4, 21.6, 21.2, 21.0, 15.0, 15.6, 24.0, 34.6, 40.9,
>> 30.7,
>>              24.5, 16.6, 1, 21.7, 1, 23.6, 1, 25.7, 19.3, 46.9, 23.3,
>> 21.8,
>> 33.3,
>>              24.9, 24.4, 1, 19.8, 17.2, 21.5, 25.5, 23.3, 18.6, 22.0,
>> 29.8,
>> 33.3,
>>              1, 21.3, 18.6, 26.8, 19.4, 21.1, 21.2, 20.5, 19.8, 26.3,
>> 39.3,
>> 21.4,
>>              22.6, 1, 35.3, 7.0, 19.3, 21.3, 10.1, 20.2, 1, 36.2, 16.7,
>> 21.1, 39.1,
>>              19.9, 32.1, 23.1, 21.8, 30.4, 19.62, 15.5]
>>
>> sample2=[16.5, 1, 22.6, 25.3, 23.7, 1, 23.3, 23.9, 16.2, 23.0, 21.6,
>> 10.8,
>> 12.2,
>>              23.6, 10.1, 24.4, 16.4, 11.7, 17.7, 34.3, 24.3, 18.7, 27.5,
>> 25.8, 22.5,
>>              14.2, 21.7, 1, 31.2, 13.8, 29.7, 23.1, 26.1, 25.1, 23.4,
>> 21.7,
>> 24.4, 13.2,
>>              22.1, 26.7, 22.7, 1, 18.2, 28.7, 29.1, 27.4, 22.3, 13.2,
>> 22.5,
>> 25.0, 1,
>>              6.6, 23.7, 23.5, 17.3, 24.6, 27.8, 29.7, 25.3, 19.9, 18.2,
>> 26.2, 20.4,
>>              23.3, 26.7, 26.0, 1, 25.1, 33.1, 35.0, 25.3, 23.6, 23.2,
>> 20.2,
>> 24.7, 22.6,
>>             39.1, 26.5, 22.7]
>>
>> Running the tests:
>> R:
>> TT = ks.test(sample1, sample2)
>> TG = ks.test(sample1, sample2, alternative="greater")
>> TL = ks.test(sample1, sample2, alternative="less")
>>
>> TT Result: D = 0.2204, p-value = 0.04205   alternative hypothesis:
>> two-sided
>> TG Result: D^+ = 0.2204, p-value = 0.02102  alternative hypothesis: the
>> CDF
>> of x lies above that of y
>> TL Result: D^- = 0.1242, p-value = 0.2933  alternative hypothesis: the
>> CDF
>> of x lies below that of y
>>
>> Scipy:
>>
>> TT=scipy.stats.mstats.ks_twosamp(sample1, sample2)
>> TU=scipy.stats.mstats.ks_twosamp(sample1, sample2, alternative='greater')
>> TL=scipy.stats.mstats.ks_twosamp(sample1, sample2, alternative='less')
>>
>> TT Result: D= 0.220411392405, p-value= 0.0420492678738
>> TU Result: D= 0.124208860759 p-value: 0.293327703926
>> TL Result: D=: 0.220411392405, p-value: 0.0210248293393
>>
>> So as it can be seen from the results the one tailed upper and lower
>> values
>> seemed to be reversed. In my app my results were more consistent with
>> R's.
>> Am I missing something obvious here i.e. with definitions? or is there
>> potentially a bug in the scipy code?
>> Any help will be much appreciated. Cheers.
> 
> It's not really a bug, since the documentation for mstats.ks_2samp
> doesn't specify what is meant by greater or less.
> 
> But I think this should be clarified in the documentation and changed.
> For stats.kstest I followed the R definition for the one-sided tests, IIRC
> 
> 
> Aside:
> 
> One part that I always find confusing in this is that having a larger
> cdf means that the random values are smaller (in a stochastic
> dominance sense)
> http://en.wikipedia.org/wiki/Stochastic_dominance#First-order_stochastic_dominance
> "A dominating B means that F_A(x) <= F_B(x) for all x, with strict
> inequality at some x"
> 
> However Kolmogorov-Smirnov only looks at the maximum deviation of the
> cdfs in either direction.
> In your example we have several intersections
> 
> import matplotlib.pyplot as plt
> plt.figure()
> n1 = len(sample1)
> n2 = len(sample2)
> plt.step(np.sort(sample1), np.arange(1, n1+1)/(n1+1.), label='sample1')
> plt.step(np.sort(sample2), np.arange(1, n2+1)/(n2+1.), label='sample2')
> plt.legend()
> 
> import statsmodels.graphics.gofplots as smgp
> fig2 = smgp.qqplot_2samples(np.asarray(sample1)[:-1],
> np.asarray(sample2), line='45') #requires equal length
> plt.show()
> 
> Josef
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Two-Sample-Kolmogorov-Smirnov-Test-scipy-vs-R-tp34814758p34814758.html
>> Sent from the Scipy-User mailing list archive at Nabble.com.
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
> 
> 
-- 
View this message in context: http://old.nabble.com/Two-Sample-Kolmogorov-Smirnov-Test-scipy-vs-R-tp34814758p34816338.html
Sent from the Scipy-User mailing list archive at Nabble.com.



More information about the SciPy-User mailing list