[SciPy-User] [SciPy-user] Two Sample Kolmogorov-Smirnov Test scipy vs R

josef.pktd@gmai... josef.pktd@gmai...
Thu Dec 20 07:16:48 CST 2012


On Thu, Dec 20, 2012 at 6:50 AM, amundell <andrewhdmundell@gmail.com> wrote:
>
> Hi again Josef, after investigating a few different alogrithm approaches
> (most notably in Press, Teukolsky et al. Numerical Recipes) and looking into
> R's source), it does seem that 'greater' and 'less' does refer to maximum
> (D) deviation of the first sample's (sample1's) cdf 'above' and 'below' the
> sample2's cdf. By observing and calculating the maximum values in the plot
> of my data, they are consistent with the results that SciPy
> (scipy.stats.mstats.ks_twosamp) generate. I have just confirmed I am
> generating similar results in a third software package to SciPy. I therefore
> believe SciPy's results are accurate.

The results are numerically correct, but the definition of 'greater'
and 'less' is still reversed in scipy.stats.mstats.?

BTW there is a tie-correction in the mstats version, where I don't
know whether it has the same asymptotic distribution for the test
statistic.

Josef


>
> Andrew
>
>
> amundell wrote:
>>
>> Hi Josef,
>>
>> Thanks for your quick response and information. It does seem a little
>> confusing as R uses these parameters names as well, but I guess
>> documentations can clarify this. Incidentally, I did make some comparison
>> tests with scipy.stat.kstest, with SciPy results appearing more accurate
>> as confirmed by a third statistical software package. In the two sample
>> test R and the third software matched closely with their tail data. I am
>> quite new to working with this type of test and also a little confused
>> with the meanings of "greater" and "less" in this scenario.
>>
>> I am going to make some further investigations into the test theory as
>> well as the algorithms being used in R and other packages and see if I can
>> come up with an answer. Will keep you posted.
>>
>> Thanks again,
>>
>> Andrew
>>
>>
>>
>> josef.pktd wrote:
>>>
>>> On Wed, Dec 19, 2012 at 7:16 AM, amundell <andrewhdmundell@gmail.com>
>>> wrote:
>>>>
>>>> I am currently creating a statistical app where I am comparing my
>>>> hypothesis
>>>> test results with R and Python (scipy) libraries. So far so good with
>>>> most
>>>> test. However I have found a discrepancy with the R and Python results
>>>> for
>>>> the Two-Sample Kolmogorov-Smirnov Tests. Below are data vectors I have
>>>> been
>>>> using obviously formatted for both R ks.test and
>>>> scipy.stat.msstats.ks_twosamp methods.
>>>>
>>>> sample1=[23.4, 30.9, 18.8, 23.0, 21.4, 1, 24.6, 23.8, 24.1, 18.7, 16.3,
>>>> 20.3,
>>>>              14.9, 35.4, 21.6, 21.2, 21.0, 15.0, 15.6, 24.0, 34.6, 40.9,
>>>> 30.7,
>>>>              24.5, 16.6, 1, 21.7, 1, 23.6, 1, 25.7, 19.3, 46.9, 23.3,
>>>> 21.8,
>>>> 33.3,
>>>>              24.9, 24.4, 1, 19.8, 17.2, 21.5, 25.5, 23.3, 18.6, 22.0,
>>>> 29.8,
>>>> 33.3,
>>>>              1, 21.3, 18.6, 26.8, 19.4, 21.1, 21.2, 20.5, 19.8, 26.3,
>>>> 39.3,
>>>> 21.4,
>>>>              22.6, 1, 35.3, 7.0, 19.3, 21.3, 10.1, 20.2, 1, 36.2, 16.7,
>>>> 21.1, 39.1,
>>>>              19.9, 32.1, 23.1, 21.8, 30.4, 19.62, 15.5]
>>>>
>>>> sample2=[16.5, 1, 22.6, 25.3, 23.7, 1, 23.3, 23.9, 16.2, 23.0, 21.6,
>>>> 10.8,
>>>> 12.2,
>>>>              23.6, 10.1, 24.4, 16.4, 11.7, 17.7, 34.3, 24.3, 18.7, 27.5,
>>>> 25.8, 22.5,
>>>>              14.2, 21.7, 1, 31.2, 13.8, 29.7, 23.1, 26.1, 25.1, 23.4,
>>>> 21.7,
>>>> 24.4, 13.2,
>>>>              22.1, 26.7, 22.7, 1, 18.2, 28.7, 29.1, 27.4, 22.3, 13.2,
>>>> 22.5,
>>>> 25.0, 1,
>>>>              6.6, 23.7, 23.5, 17.3, 24.6, 27.8, 29.7, 25.3, 19.9, 18.2,
>>>> 26.2, 20.4,
>>>>              23.3, 26.7, 26.0, 1, 25.1, 33.1, 35.0, 25.3, 23.6, 23.2,
>>>> 20.2,
>>>> 24.7, 22.6,
>>>>             39.1, 26.5, 22.7]
>>>>
>>>> Running the tests:
>>>> R:
>>>> TT = ks.test(sample1, sample2)
>>>> TG = ks.test(sample1, sample2, alternative="greater")
>>>> TL = ks.test(sample1, sample2, alternative="less")
>>>>
>>>> TT Result: D = 0.2204, p-value = 0.04205   alternative hypothesis:
>>>> two-sided
>>>> TG Result: D^+ = 0.2204, p-value = 0.02102  alternative hypothesis: the
>>>> CDF
>>>> of x lies above that of y
>>>> TL Result: D^- = 0.1242, p-value = 0.2933  alternative hypothesis: the
>>>> CDF
>>>> of x lies below that of y
>>>>
>>>> Scipy:
>>>>
>>>> TT=scipy.stats.mstats.ks_twosamp(sample1, sample2)
>>>> TU=scipy.stats.mstats.ks_twosamp(sample1, sample2,
>>>> alternative='greater')
>>>> TL=scipy.stats.mstats.ks_twosamp(sample1, sample2, alternative='less')
>>>>
>>>> TT Result: D= 0.220411392405, p-value= 0.0420492678738
>>>> TU Result: D= 0.124208860759 p-value: 0.293327703926
>>>> TL Result: D=: 0.220411392405, p-value: 0.0210248293393
>>>>
>>>> So as it can be seen from the results the one tailed upper and lower
>>>> values
>>>> seemed to be reversed. In my app my results were more consistent with
>>>> R's.
>>>> Am I missing something obvious here i.e. with definitions? or is there
>>>> potentially a bug in the scipy code?
>>>> Any help will be much appreciated. Cheers.
>>>
>>> It's not really a bug, since the documentation for mstats.ks_2samp
>>> doesn't specify what is meant by greater or less.
>>>
>>> But I think this should be clarified in the documentation and changed.
>>> For stats.kstest I followed the R definition for the one-sided tests,
>>> IIRC
>>>
>>>
>>> Aside:
>>>
>>> One part that I always find confusing in this is that having a larger
>>> cdf means that the random values are smaller (in a stochastic
>>> dominance sense)
>>> http://en.wikipedia.org/wiki/Stochastic_dominance#First-order_stochastic_dominance
>>> "A dominating B means that F_A(x) <= F_B(x) for all x, with strict
>>> inequality at some x"
>>>
>>> However Kolmogorov-Smirnov only looks at the maximum deviation of the
>>> cdfs in either direction.
>>> In your example we have several intersections
>>>
>>> import matplotlib.pyplot as plt
>>> plt.figure()
>>> n1 = len(sample1)
>>> n2 = len(sample2)
>>> plt.step(np.sort(sample1), np.arange(1, n1+1)/(n1+1.), label='sample1')
>>> plt.step(np.sort(sample2), np.arange(1, n2+1)/(n2+1.), label='sample2')
>>> plt.legend()
>>>
>>> import statsmodels.graphics.gofplots as smgp
>>> fig2 = smgp.qqplot_2samples(np.asarray(sample1)[:-1],
>>> np.asarray(sample2), line='45') #requires equal length
>>> plt.show()
>>>
>>> Josef
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://old.nabble.com/Two-Sample-Kolmogorov-Smirnov-Test-scipy-vs-R-tp34814758p34814758.html
>>>> Sent from the Scipy-User mailing list archive at Nabble.com.
>>>>
>>>> _______________________________________________
>>>> SciPy-User mailing list
>>>> SciPy-User@scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>>
>>
> --
> View this message in context: http://old.nabble.com/Two-Sample-Kolmogorov-Smirnov-Test-scipy-vs-R-tp34814758p34819082.html
> Sent from the Scipy-User mailing list archive at Nabble.com.
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user


More information about the SciPy-User mailing list