[SciPy-User] [SciPy-user] Two Sample Kolmogorov-Smirnov Test scipy vs R

amundell andrewhdmundell@gmail....
Thu Dec 20 05:50:38 CST 2012


Hi again Josef, after investigating a few different alogrithm approaches
(most notably in Press, Teukolsky et al. Numerical Recipes) and looking into
R's source), it does seem that 'greater' and 'less' does refer to maximum
(D) deviation of the first sample's (sample1's) cdf 'above' and 'below' the
sample2's cdf. By observing and calculating the maximum values in the plot
of my data, they are consistent with the results that SciPy
(scipy.stats.mstats.ks_twosamp) generate. I have just confirmed I am
generating similar results in a third software package to SciPy. I therefore
believe SciPy's results are accurate.

Andrew


amundell wrote:
> 
> Hi Josef,
> 
> Thanks for your quick response and information. It does seem a little
> confusing as R uses these parameters names as well, but I guess
> documentations can clarify this. Incidentally, I did make some comparison
> tests with scipy.stat.kstest, with SciPy results appearing more accurate
> as confirmed by a third statistical software package. In the two sample
> test R and the third software matched closely with their tail data. I am
> quite new to working with this type of test and also a little confused
> with the meanings of "greater" and "less" in this scenario. 
> 
> I am going to make some further investigations into the test theory as
> well as the algorithms being used in R and other packages and see if I can
> come up with an answer. Will keep you posted.
> 
> Thanks again,
> 
> Andrew
> 
> 
> 
> josef.pktd wrote:
>> 
>> On Wed, Dec 19, 2012 at 7:16 AM, amundell <andrewhdmundell@gmail.com>
>> wrote:
>>>
>>> I am currently creating a statistical app where I am comparing my
>>> hypothesis
>>> test results with R and Python (scipy) libraries. So far so good with
>>> most
>>> test. However I have found a discrepancy with the R and Python results
>>> for
>>> the Two-Sample Kolmogorov-Smirnov Tests. Below are data vectors I have
>>> been
>>> using obviously formatted for both R ks.test and
>>> scipy.stat.msstats.ks_twosamp methods.
>>>
>>> sample1=[23.4, 30.9, 18.8, 23.0, 21.4, 1, 24.6, 23.8, 24.1, 18.7, 16.3,
>>> 20.3,
>>>              14.9, 35.4, 21.6, 21.2, 21.0, 15.0, 15.6, 24.0, 34.6, 40.9,
>>> 30.7,
>>>              24.5, 16.6, 1, 21.7, 1, 23.6, 1, 25.7, 19.3, 46.9, 23.3,
>>> 21.8,
>>> 33.3,
>>>              24.9, 24.4, 1, 19.8, 17.2, 21.5, 25.5, 23.3, 18.6, 22.0,
>>> 29.8,
>>> 33.3,
>>>              1, 21.3, 18.6, 26.8, 19.4, 21.1, 21.2, 20.5, 19.8, 26.3,
>>> 39.3,
>>> 21.4,
>>>              22.6, 1, 35.3, 7.0, 19.3, 21.3, 10.1, 20.2, 1, 36.2, 16.7,
>>> 21.1, 39.1,
>>>              19.9, 32.1, 23.1, 21.8, 30.4, 19.62, 15.5]
>>>
>>> sample2=[16.5, 1, 22.6, 25.3, 23.7, 1, 23.3, 23.9, 16.2, 23.0, 21.6,
>>> 10.8,
>>> 12.2,
>>>              23.6, 10.1, 24.4, 16.4, 11.7, 17.7, 34.3, 24.3, 18.7, 27.5,
>>> 25.8, 22.5,
>>>              14.2, 21.7, 1, 31.2, 13.8, 29.7, 23.1, 26.1, 25.1, 23.4,
>>> 21.7,
>>> 24.4, 13.2,
>>>              22.1, 26.7, 22.7, 1, 18.2, 28.7, 29.1, 27.4, 22.3, 13.2,
>>> 22.5,
>>> 25.0, 1,
>>>              6.6, 23.7, 23.5, 17.3, 24.6, 27.8, 29.7, 25.3, 19.9, 18.2,
>>> 26.2, 20.4,
>>>              23.3, 26.7, 26.0, 1, 25.1, 33.1, 35.0, 25.3, 23.6, 23.2,
>>> 20.2,
>>> 24.7, 22.6,
>>>             39.1, 26.5, 22.7]
>>>
>>> Running the tests:
>>> R:
>>> TT = ks.test(sample1, sample2)
>>> TG = ks.test(sample1, sample2, alternative="greater")
>>> TL = ks.test(sample1, sample2, alternative="less")
>>>
>>> TT Result: D = 0.2204, p-value = 0.04205   alternative hypothesis:
>>> two-sided
>>> TG Result: D^+ = 0.2204, p-value = 0.02102  alternative hypothesis: the
>>> CDF
>>> of x lies above that of y
>>> TL Result: D^- = 0.1242, p-value = 0.2933  alternative hypothesis: the
>>> CDF
>>> of x lies below that of y
>>>
>>> Scipy:
>>>
>>> TT=scipy.stats.mstats.ks_twosamp(sample1, sample2)
>>> TU=scipy.stats.mstats.ks_twosamp(sample1, sample2,
>>> alternative='greater')
>>> TL=scipy.stats.mstats.ks_twosamp(sample1, sample2, alternative='less')
>>>
>>> TT Result: D= 0.220411392405, p-value= 0.0420492678738
>>> TU Result: D= 0.124208860759 p-value: 0.293327703926
>>> TL Result: D=: 0.220411392405, p-value: 0.0210248293393
>>>
>>> So as it can be seen from the results the one tailed upper and lower
>>> values
>>> seemed to be reversed. In my app my results were more consistent with
>>> R's.
>>> Am I missing something obvious here i.e. with definitions? or is there
>>> potentially a bug in the scipy code?
>>> Any help will be much appreciated. Cheers.
>> 
>> It's not really a bug, since the documentation for mstats.ks_2samp
>> doesn't specify what is meant by greater or less.
>> 
>> But I think this should be clarified in the documentation and changed.
>> For stats.kstest I followed the R definition for the one-sided tests,
>> IIRC
>> 
>> 
>> Aside:
>> 
>> One part that I always find confusing in this is that having a larger
>> cdf means that the random values are smaller (in a stochastic
>> dominance sense)
>> http://en.wikipedia.org/wiki/Stochastic_dominance#First-order_stochastic_dominance
>> "A dominating B means that F_A(x) <= F_B(x) for all x, with strict
>> inequality at some x"
>> 
>> However Kolmogorov-Smirnov only looks at the maximum deviation of the
>> cdfs in either direction.
>> In your example we have several intersections
>> 
>> import matplotlib.pyplot as plt
>> plt.figure()
>> n1 = len(sample1)
>> n2 = len(sample2)
>> plt.step(np.sort(sample1), np.arange(1, n1+1)/(n1+1.), label='sample1')
>> plt.step(np.sort(sample2), np.arange(1, n2+1)/(n2+1.), label='sample2')
>> plt.legend()
>> 
>> import statsmodels.graphics.gofplots as smgp
>> fig2 = smgp.qqplot_2samples(np.asarray(sample1)[:-1],
>> np.asarray(sample2), line='45') #requires equal length
>> plt.show()
>> 
>> Josef
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Two-Sample-Kolmogorov-Smirnov-Test-scipy-vs-R-tp34814758p34814758.html
>>> Sent from the Scipy-User mailing list archive at Nabble.com.
>>>
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>> 
>> 
> 
-- 
View this message in context: http://old.nabble.com/Two-Sample-Kolmogorov-Smirnov-Test-scipy-vs-R-tp34814758p34819082.html
Sent from the Scipy-User mailing list archive at Nabble.com.



More information about the SciPy-User mailing list