[SciPy-User] ks_2samp is not giving the same results as ks.test in R

josef.pktd@gmai... josef.pktd@gmai...
Thu Nov 1 22:27:01 CDT 2012


On Thu, Nov 1, 2012 at 9:42 PM,  <josef.pktd@gmail.com> wrote:
> On Thu, Nov 1, 2012 at 9:41 PM,  <josef.pktd@gmail.com> wrote:
>> On Thu, Nov 1, 2012 at 9:14 PM,  <josef.pktd@gmail.com> wrote:
>>> On Thu, Nov 1, 2012 at 8:28 PM, Peng Yu <pengyu.ut@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> The ks_2samp function does not give the same answer as ks.test in R.
>>>> Does anybody know why they are different? Is ks_2samp compute
>>>> something different?
>>>>
>>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ Rscript main.R
>>>>> ks.test(1:5, 11:15)
>>>>
>>>>         Two-sample Kolmogorov-Smirnov test
>>>>
>>>> data:  1:5 and 11:15
>>>> D = 1, p-value = 0.007937
>>>> alternative hypothesis: two-sided
>>>>
>>>>> ks.test(1:5, 11:15, alternative='less')
>>>>
>>>>         Two-sample Kolmogorov-Smirnov test
>>>>
>>>> data:  1:5 and 11:15
>>>> D^- = 0, p-value = 1
>>>> alternative hypothesis: the CDF of x lies below that of y
>>>>
>>>>> ks.test(1:5, 11:15, alternative='greater')
>>>>
>>>>         Two-sample Kolmogorov-Smirnov test
>>>>
>>>> data:  1:5 and 11:15
>>>> D^+ = 1, p-value = 0.006738
>>>> alternative hypothesis: the CDF of x lies above that of y
>>>>
>>>>>
>>>>>
>>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ ./main.py
>>>> (1.0, 0.0037813540593701006)
>>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ cat main.py
>>>> #!/usr/bin/env python
>>>>
>>>> from scipy.stats import ks_2samp
>>>> print ks_2samp([1,2,3,4,5], [11,12,13,14,15])
>>>
>>> R uses by default an "exact" distribution for small samples if there
>>> are no ties.
>>> If there are ties or with a large sample, R uses the asymptotic distribution.
>>>
>>> If I read the function correctly, then scipy.stats is using a small
>>> sample approximation by Stephens. (But I would have to look up the
>>> formula to verify this.)
>>
>> http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_test
>> has the weighted sample size: en = np.sqrt(n1*n2/float(n1+n2))
>> the small sample weighting ((en+0.12+0.11/en)*d) is the same as in
>> Stephens (1970, 1985?) for the one sample test.
>> I don't have a reference for the two sample approximation right now.
>>
>> (another bit of random information)
>> tables are often only available for 0.01 to 0.25 and approximations
> (hit send too fast)  0.001 to 0.25
>> are targeted on that range and might not be as accurate outside of it
>>
>> Josef
>>
>>
>>>
>>> In the example below with a bit larger sample and no ties, our
>>> approximation is closer to R's "exact" pvalue than the asymptotic
>>> distribution if exact=FALSE.
>>>
>>>>  ks.test(1:25, (10:30)-0.5, exact=FALSE)
>>>
>>>         Two-sample Kolmogorov-Smirnov test
>>>
>>> data:  1:25 and (10:30) - 0.5
>>> D = 0.36, p-value = 0.1038
>>> alternative hypothesis: two-sided
>>>
>>>>  ks.test(1:25, (10:30)-0.5, exact=TRUE)
>>>
>>>         Two-sample Kolmogorov-Smirnov test
>>>
>>> data:  1:25 and (10:30) - 0.5
>>> D = 0.36, p-value = 0.07608
>>> alternative hypothesis: two-sided
>>>
>>>
>>>>>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5)
>>> (0.35999999999999999, 0.078993426961291274)
>>>
>>>
>>> For the 1 sample kstest I used (when I rewrote stats.kstest) an
>>> approximation that is closer to the exact distribution than the
>>> asymptotic distribution, but it's also not exact.
>>>
>>> It would be good to have better small sample approximations or exact
>>> distributions, but I worked on this in scipy.stats when I barely had
>>> any idea about goodness-of-fit tests.
>>> Also, ks_2samp never got the enhancement for one-sided alternatives.
>>> (In statsmodels I have been working so far only on one sample tests,
>>> but not on two-sample tests.)
>>>
>>> (I don't remember if there is a minimum size recommendation, but the
>>> examples I usually checked were larger.)
>>
>> matlab help: http://www.mathworks.com/help/stats/kstest2.html
>> "The asymptotic p value becomes very accurate for large sample sizes,
>> and is believed to be reasonably accurate for sample sizes n1 and n2
>> such that (n1*n2)/(n1 + n2) >= 4."
>>>
>>>
>>> since it's a community project: Pull Request are welcome

(since I haven't spammed the mailing list in a while, and it's fast)

I think, except in very small sample, ks_2samp looks ok.

(However, independent of our p-values, the power of Kolmogorov-Smirnov
in small samples is awful.
>>> stats.ttest_ind(np.arange(1.,26), np.arange(10,31.)-0.5)
(-3.2015129142545442, 0.0025398589272691129)       reject Null
>>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5)
(0.35999999999999999, 0.078993426961291274)        don't reject Null at 0.05
)

---------------------------
# -*- coding: utf-8 -*-
"""Comparing pvalues for ks_2samp in small sample, permutation test

Created on Thu Nov 01 22:42:30 2012
Author: Josef Perktold

Results:

sample: 1
permutation 0.07704
scipy       0.0789934269613
R exact     0.07608
R asymp     0.1038

subset results
[ 0.0774  0.0783  0.0732  0.0772  0.0775  0.0755  0.0791  0.0743  0.0816
  0.0763]
>>> execfile('permutations_ks1samp.py')

sample: 0
permutation 0.00791
scipy       0.00378135405937
R exact     0.007937
R asymp     0.01348

subset results
[ 0.008   0.0094  0.0074  0.0071  0.0071  0.0084  0.0073  0.0084  0.0083
  0.0077]

"""

import numpy as np
from scipy import stats

sample = 0

if sample == 0:
    x1, y1 = np.arange(1.,6), np.arange(11,16.)
else:
    x1, y1 = np.arange(1.,26), np.arange(10,31.)-0.5
nx = len(x1)

D, pval = stats.ks_2samp(x1, y1)

#permutation p-values
n_rep = 100000  #make it large
results = np.empty((n_rep, 2))
results.fill(np.nan)

x, y = x1.copy(), y1.copy()
xy = np.concatenate((x,y))
np.random.seed(2345678)
for ii in xrange(n_rep):
    np.random.shuffle(xy)
    results[ii] = stats.ks_2samp(xy[:nx], xy[nx:])

print '\nsample:', sample
print 'permutation', (results[:,0] >= D).mean()
print 'scipy      ', pval
if sample == 0:
    print 'R exact    ', 0.007937
    print 'R asymp    ', 0.01348
else:
    print 'R exact    ', 0.07608
    print 'R asymp    ', 0.1038

print '\nsubset results'
print (results[:,0].reshape(10000, -1) >= D).mean(0)
-------------------------

Josef


>>>
>>> Josef
>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Peng
>>>> _______________________________________________
>>>> SciPy-User mailing list
>>>> SciPy-User@scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-user


More information about the SciPy-User mailing list