[SciPy-user] new Kolmogorov-Smirnov 2-sample test not Numerical Recipes based

josef.pktd@gmai... josef.pktd@gmai...
Wed Dec 3 20:53:09 CST 2008


On Wed, Dec 3, 2008 at 3:15 PM,  <josef.pktd@gmail.com> wrote:
> On Wed, Dec 3, 2008 at 2:49 PM, Jarrod Millman <millman@berkeley.edu> wrote:
>> On Wed, Dec 3, 2008 at 11:43 AM, Matthew Brett <matthew.brett@gmail.com> wrote:
>>>> def ks_2samp(data1, data2):
>>>>    """ Computes the Kolmogorov-Smirnof statistic on 2 samples.  Modified
>>>>    from Numerical Recipies in C, page 493.  Returns KS D-value, prob.  Not
>>>>    ufunc- like.
>>>
>>> Wait - really?  We can't use Numerical Recipes code, it has strict and
>>> incompatible licensing...  If it's in there it really has to come out
>>> as fast as possible.
>>
>> http://www.nr.com/licenses/redistribute.html
>>

I did a 2sample kstest based on the definition using search sorted,
see function below.
Attached is script file with standalone old and new versions and Monte
Carlo evaluation.
I didn't do any speed comparison.
The function works only for one dimensional data, neither does the old
one, but there is some
initial more than 1 dimension setup in the old version that I don't understand.

Main reference used (after googling):
http://math.ucsd.edu/~gptesler/283/kolmogorov_smirnov_05.pdf

If there is any interest, I can replace the existing implementation
with the new function

Josef

Conclusion:
Comparing old implementation based on numerical recipes with new implementation

* in 2 samples have same size then difference in KS statistic is less than 1e-14
* if sample sizes differ
  * in n1=2, n2=3 example new version is correct, old version is wrong
  * Monte Carlo essentially identical results (sample sizes between 100 and 1000
    rejection rates for alpha = 1%,5%,10% sometimes differ in 3rd decimal

>>> data1 = np.array([1.0,2.0])
>>> data2 = np.array([1.0,2.0,3.0])
>>> ks_2samp_new(data1+0.01,data2)
(0.33333333333333337, 0.99062316386915694)
>>> ks_2samp(data1+0.01,data2)
(0.33333333333333331, 0.99062316386915694)
>>> ks_2samp_new(data1-0.01,data2)
(0.66666666666666674, 0.42490954988801982)
>>> ks_2samp(data1-0.01,data2)  # result not correct
(-0.5, 0.77962787254643151)


===============
def ks_2samp_new(data1, data2):
    """ Computes the Kolmogorov-Smirnof statistic on 2 samples.

    Returns: KS D-value, p-value
    """
    data1, data2 = map(asarray, (data1, data2))
    n1 = data1.shape[0]
    n2 = data2.shape[0]
    n1 = len(data1)
    n2 = len(data2)
    data1 = np.sort(data1)
    data2 = np.sort(data2)
    data_all = np.concatenate([data1,data2])
    cdf1 = np.searchsorted(data1,data_all,side='right')/(1.0*n1)
    cdf2 = (np.searchsorted(data2,data_all,side='right'))/(1.0*n2)
    d = np.max(np.absolute(cdf1-cdf2))
    #Note: d absolute not signed distance
    en = np.sqrt(n1*n2/float(n1+n2))
    try:
        prob = ksprob((en+0.12+0.11/en)*d)
    except:
        prob = 1.0
    return d, prob
====================
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ks2samp_rewrite_cleaned.py
Url: http://projects.scipy.org/pipermail/scipy-user/attachments/20081203/abce4678/attachment-0001.pl 


More information about the SciPy-user mailing list