[SciPy-User] Kuiper test (was Re: scipy.stats.fit inquiry)

Anne Archibald peridot.faceted@gmail....
Wed Oct 21 17:15:21 CDT 2009


2009/10/21  <josef.pktd@gmail.com>:
> When the parameters of the t distribution
> are estimated, then the test has no power.
> I don't know if adjusting the critical values
> would help at least for comparing similar
> distributions like norm and t.
> (same problem with kstest)

Yes, this test has its limitations. It's also not as sensitive as one
might wish for distinguishing multimodal distributions from a
constant.

> To save on the lambda function, the cdf could take args given as
> an argument to kuiper. Maybe the specification of the cdf argument
> could be taken from kstest, but not (the generality of) the rvs argument
> in kstest.

I loathe interfaces that pass args; I find they make the interface
more confusing while adding no functionality. But I realize that
they're standard and not going away (what about the currying
functionality in new pythons?), so I'll add it.

> Your cdf_from_intervals looks nice for a univariate stepfunction distribution,
> There are several pieces for this on the mailing list, but I never got around
> to collecting them.

This was just a quick hack because I needed it; I'm not sure it's
worth including in scipy.stats, since an appropriately general version
might be just as difficult to call as to reimplement.

> Your interval functions, I haven't quite figured out. Is fold_intervals
> supposed to convert a set of overlapping intervals to a non-overlapping
> partitioning with associated weights? For example for merging
> 2 histograms with non-identical bounds?

That among other things. The context was that I had observations of an
X-ray binary, consisting of several different segments with several
different instruments. So I'd have one observation covering orbital
phases 0.8 to 1.3 (i.e. 0.5 turns) with effective area 10 cm^2 (say),
and another covering phases 0.2 to 2.1 (i.e. 1.9 turns) with effective
area 12 cm^2 (say). So a constant flux from the source would produce a
non-constant distribution of photons modulo 1; fold_intervals is
designed to convert those two weighted spans to a collection of
non-overlapping weighted intervals covering [0,1), for use in a CDF
for the Kuiper test.  The histogram stuff is there to allow plotting
the results in a familiar form.

> I  haven't figured out the shift invariance.

The idea is just that if you have, say, a collection X of samples from
[0,1) that you are testing for uniformity, the Kuiper test returns the
same result for X and for (X+0.3)%1. Since for pulsars there is often
no natural start time, this is a sensible thing to ask from any test
for uniformity.

>> I also have some code for the H test (essentially looking at the
>> Fourier coefficients to find how many harmonics are worth including
>> and what the significance is; de Jager et al. 1989 "A poweful test for
>> weak periodic signals with unknown light curve shape in sparse data").
>> But circular statistics seem to be a bit obscure, so I'm not sure how
>> much effort should go into putting this in scipy.
>
> For sure they are obscure to me, but there are a few circular descriptive
> statistics in stats.morestats, and I saw a matlab toolbox on the
> file exchange. I figured out by now that there are some pretty different
> statistics used in various fields.
> I guess, it's all up to you.

To be honest, these are a little obscure even among pulsar
astronomers; here as elsewhere histograms seem to dominate.

> >From your description (below), I would think, that for circular
> distribution, we would need different generic functions that
> don't fit in the current distribution classes, integration on a
> circle (?) instead of integration on the real line.

Basically, yes. Of course you can view integration on a circle as
integration on a line parameterizing the circle, but there's no way
around the fact that the first moment is a complex number which
contains both position (argument) and concentration (magnitude)
information.

> If I define boundaries, vonmises.a and vonmises.b, then
> I think, you would not be able to calculate pdf(x), cdf(x)
> and so on for x outside of the support [a,b]. I don't know
> whether it is possible to define a,b but not enforce them in
> _argcheck and only use them as integration bounds.

It's useful to be able to work with pdf and cdf outside the bounds,
particularly since the effect of loc is to shift the bounds - so any
code that takes loc as a free parameter would otherwise have to work
hard to avoid going outside the bounds.

> I checked briefly, vonmises.moment and vonmises.stats
> work (integrating over ppf not pdf)
> generic moment calculation with pdf  (_mom0_sc) and
> entropy fail. fit seems to work with the normal random
> variable, but I thought I got lots of "weird" fit results
> before.
>
> For the real line, this looks "strange"
>>>> stats.vonmises.pdf(np.linspace(0,3*np.pi,10), 2)
> array([ 0.51588541,  0.18978364,  0.02568442,  0.00944877,  0.02568442,
>        0.18978364,  0.51588541,  0.18978364,  0.02568442,  0.00944877])
>>>> stats.vonmises.cdf(np.linspace(0,3*np.pi,10), 2)
> array([  0.5       ,   0.89890776,   0.98532805,   1.        ,
>         7.28318531,   7.28318531,   7.28318531,   7.28318531,
>         7.28318531,  13.56637061])

Unfortunately it seems that my attempt to fix the von Mises
distribution (using vonmises_cython) went badly awry, and now it
reports nonsense for values outside -pi..pi. I could have sworn I had
tests for that. I intend to fix this during the weekend's code sprint.
"Correct" behaviour (IMHO) would make the CDF the antiderivative of
the PDF, even though this means it leaves [0,1].

Incidentally, how would I get nosetests to run only some of the
distributions tests (ideally just the ones I choose)? When I just do
"nosetests scipy.stats" it takes *forever*.

> ppf, calculated generically, works and looks only a little bit strange.
>
>>>> stats.vonmises.ppf([0, 1e-10,1-1e-10,1],2)
> array([       -Inf, -3.14159264,  3.14159264,         Inf])
>
>>
>> It's worth defining the boundaries, but I don't think you'll get
>> useful moment calculations out of it, since circular moments are
>> defined differently from linear moments: rather than int(x**n*pdf(x))
>> they're int(exp(2,j*pi*n*x)*pdf(x)). There are explicit formulas for
>> these for the von Mises distribution, though, and they're useful
>> because they are essentially the Fourier coefficients of the pdf.
>>
>> For context, I've done some work with representing circular
>> probability distributions (= pulse profiles) in terms of their moments
>> (= Fourier series), and in particular using a kernel density estimator
>> with either a sinc kernel (= truncation) or a von Mises kernel (=
>> scaling coefficients by von Mises moments). The purpose was to get
>> not-too-biased estimates of the modulation amplitudes of X-ray
>> pulsations; the paper has been kind of sidetracked by other work.
>
> Sounds like statistics for seasonal time series to me, except you
> might have a lot more regularity in the repeated pattern then in
> economics or climate research.

Well, yes and no. Many pulsars have a stable average pulse profile,
which is what we usually care about, especially where we're in the
regime of one photon every few tens of thousands of turns. On the
other hand, when dealing with orbital variability, I ran into a
statistical question I wasn't sure how to pose: if I add all the
(partial) orbits of data I have together, I get a distribution that is
pretty clearly nonconstant. But is that variability actually repeating
from orbit to orbit, or is it really just random variability? In a
seasonal context it sounds less exotic: if I combine data from the
last three years, I find a statistically significant excess of rain in
the fall. But is this just the effect of one very rainy season, that
happened to be in the fall one year, or is it that each fall there is
an excess? The individual years, unfortunately, don't seem to have
enough events to detect significant non-uniformity.

Anne


More information about the SciPy-User mailing list