[SciPy-user] new Kolmogorov-Smirnov 2-sample test not Numerical Recipes based
Wed Dec 3 20:53:09 CST 2008
On Wed, Dec 3, 2008 at 3:15 PM, <firstname.lastname@example.org> wrote:
> On Wed, Dec 3, 2008 at 2:49 PM, Jarrod Millman <email@example.com> wrote:
>> On Wed, Dec 3, 2008 at 11:43 AM, Matthew Brett <firstname.lastname@example.org> wrote:
>>>> def ks_2samp(data1, data2):
>>>> """ Computes the Kolmogorov-Smirnof statistic on 2 samples. Modified
>>>> from Numerical Recipies in C, page 493. Returns KS D-value, prob. Not
>>>> ufunc- like.
>>> Wait - really? We can't use Numerical Recipes code, it has strict and
>>> incompatible licensing... If it's in there it really has to come out
>>> as fast as possible.
I did a 2sample kstest based on the definition using search sorted,
see function below.
Attached is script file with standalone old and new versions and Monte
I didn't do any speed comparison.
The function works only for one dimensional data, neither does the old
one, but there is some
initial more than 1 dimension setup in the old version that I don't understand.
Main reference used (after googling):
If there is any interest, I can replace the existing implementation
with the new function
Comparing old implementation based on numerical recipes with new implementation
* in 2 samples have same size then difference in KS statistic is less than 1e-14
* if sample sizes differ
* in n1=2, n2=3 example new version is correct, old version is wrong
* Monte Carlo essentially identical results (sample sizes between 100 and 1000
rejection rates for alpha = 1%,5%,10% sometimes differ in 3rd decimal
>>> data1 = np.array([1.0,2.0])
>>> data2 = np.array([1.0,2.0,3.0])
>>> ks_2samp(data1-0.01,data2) # result not correct
def ks_2samp_new(data1, data2):
""" Computes the Kolmogorov-Smirnof statistic on 2 samples.
Returns: KS D-value, p-value
data1, data2 = map(asarray, (data1, data2))
n1 = data1.shape
n2 = data2.shape
n1 = len(data1)
n2 = len(data2)
data1 = np.sort(data1)
data2 = np.sort(data2)
data_all = np.concatenate([data1,data2])
cdf1 = np.searchsorted(data1,data_all,side='right')/(1.0*n1)
cdf2 = (np.searchsorted(data2,data_all,side='right'))/(1.0*n2)
d = np.max(np.absolute(cdf1-cdf2))
#Note: d absolute not signed distance
en = np.sqrt(n1*n2/float(n1+n2))
prob = ksprob((en+0.12+0.11/en)*d)
prob = 1.0
return d, prob
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
More information about the SciPy-user