[SciPy-User] "small data" statistics
Sturla Molden
sturla@molden...
Fri Oct 12 10:01:23 CDT 2012
On 12.10.2012 16:21, Emanuele Olivetti wrote:
> 1) In this thread people expressed interest in making hypothesis testing
> from small samples, so is permutation test addressing the question of
> the accompanying motivating example? In my opinion it is not and I hope I
> provided brief but compelling motivation to support this point of view.
For the problem Josef described, I'd analyze that as a two-sample
goodness-of-fit test against a common bin(20,p) distribution.
> 2) What are the assumptions under which the permutation test is
> valid/acceptable (independently from the accompanying motivating example)?
> I have looked around on this topic but I had just found generic desiderata for
> all resampling approaches, i.e. that the sample should be "representative"
> of the underlying distribution - whatever this means in practical terms.
Ronald A. Fisher considered the permutation test to be the "exact
procedure" the t-test should approximate. It has, in fact, all the
assumptions of the t-test.
Surprisingly many think the t-test assume normally distributed data. It
does not. If you have this idea too, forget it please.
The t-test only asserts that the large-sample "sampling distribution of
the mean" (i.e. the mean you calculate, not the data point themselves)
is a normal distribution. This is due to the central limit theorem. If
you collect enough data, the distribution of the sample mean will
converge towards a normal distribution. That is a mathematical
necessity, and can be proven to always be the case. But with small data
samples, the sampling distribution of the mean can deviate from a normal
distribution. That is when we need to use the permutation test instead.
I.e.: The t-test is an approximation to the permutation test for "large
enough" data samples.
What we mean by "large enough" is another story. We can e.g. estimate
the sampling distribution of the mean using Efron's bootstrap, and run a
goodness-of-fit test. What most practitioners do, though, is to check if
their data is approximately normally distributed. That usually signifies
a lack of understanding for the t-test. They think the data must be
normal. The data do not. But if the data are normally distributed we can
be sure the sample mean is normal as well.
So under what circumstances are the assumptions for the permutation test
not satisfied?
One notable example is the Behrens-Fisher problem! That is, you want to
compare the expectancy value of two distributions with different
variance. The permutation test does not help to solve this problem any
more than the t-test does. This is clearly a situation where
distributions matter, showing that the permutation test is not a
"distribution free" test.
Sturla
More information about the SciPy-User
mailing list