[SciPy-User] "small data" statistics
Fri Oct 12 03:36:12 CDT 2012
On 10/11/2012 04:57 PM, email@example.com wrote:
> Most statistical tests and statistical inference in scipy.stats and
> statsmodels relies on large number assumptions.
> Everyone is talking about "Big data", but is anyone still interested
> in doing small sample statistics in python.
> I'd like to know whether it's worth spending any time on general
> purpose small sample statistics.
> for example:
> Example homework problem:
> Shallow Processing: 13 12 11 9 11 13 14 14 14 15
> Deep Processing: 12 15 14 14 13 12 15 14 16 17
I am very interested in inference from small samples, but I have
some concerns about both the example and the proposed approach
based on the permutation test.
IMHO the question in the example at that URL, i.e. "Did the instructions
given to the participants significantly affect their level of recall?" is
not directly addressed by the permutation test. The permutation test is
related the question "how (un)likely is the collected dataset under the
assumption that the instructions did not affect the level of recall?".
In other words the initial question is about quantifying how likely is the
hypothesis "the instructions do not affect the level of recall"
(let's call it H_0) given the collected dataset, with respect to how likely is the
hypothesis "the instructions affect the level of recall" (let's call it H_1)
given the data. In a bit more formal notation the initial question is about
estimating p(H_0|data) and p(H_1|data), while the permutation test provides
a different quantity, which is related (see ) to p(data|H_0). Clearly
p(data|H_0) is different from p(H_0|data).
Literature on this point is for example http://dx.doi.org/10.1016/j.socec.2004.09.033
On a different side, I am also interested in understanding which are the assumptions
under which the permutation test is expected to work. I am not an expert in that
field but, as far as I know, the permutation test - and all resampling approaches
in general - requires that the sample is "representative" of the underlying
distribution of the problem. In my opinion this requirement is difficult to assess
in practice and it is even more troubling for the specific case of "small data" - of
interest for this thread.
Any comment on these points is warmly welcome.
 A minor detail: I said "related" because the outcome of the permutation test,
and of classical tests for hypothesis testing in general, is not precisely p(data|H_0).
First of all those tests rely on a statistic of the dataset and not on the dataset itself.
In the example at the URL the statistic (called "criterion" there) is the difference
between the means of the two groups. Second and more important,
the test provides an estimate of the probability of observing such a value
for the statistic... "or a more extreme one". So if we call the statistic over the
data as T(data), then the classical tests provide p(t>T(data)|H_0), and not
p(data|H_0). Anyway even p(t>T(data)|H_0) is clearly different from the initial
question, i.e. p(H_0|data).
More information about the SciPy-User