[SciPy-User] "small data" statistics

Nathaniel Smith njs@pobox....
Sat Oct 13 04:43:54 CDT 2012

On Fri, Oct 12, 2012 at 4:27 PM, Emanuele Olivetti
<emanuele@relativita.com> wrote:
> On 10/12/2012 01:22 PM, Nathaniel Smith wrote:
> On 12 Oct 2012 09:37, "Emanuele Olivetti" <emanuele@relativita.com> wrote:
>> IMHO the question in the example at that URL, i.e. "Did the instructions
>> given to the participants significantly affect their level of recall?" is
>> not directly addressed by the permutation test.
> In this sentence, the word "significantly" is a term of art used to refer
> exactly to the quantity p(t>T(data)|H_0). So, yes, the permutation test
> addresses the original question; you just have to be familiar with the
> field's particular jargon to understand what they're saying. :-)
> Thanks Nathaniel for pointing that out. I guess I'll hardly be much familiar
> with
> such a jargon ;-). Nevertheless while reading the example I believed
> that the aim of the thought experiment was to decide among two competing
> theories/hypothesis, given the results of the experiment.

Well, it is, at some level. But in practice psychologists are not
simple Bayesian updaters, and in the context of their field's
practices, the way you make these decisions involves Neyman-Pearson
significance tests as one component. Of course one can debate whether
that is a good thing or not (I actually tend to fall on the side that
says it *is* a good thing), but that's getting pretty far afield of
Josef's question :-).

> But I share your point that the term "significant" turns it into a different
> question.
> All tests require some kind of representativeness, and this isn't really a
> problem. The data are by definition representative (in the technical sense)
> of the distribution they were drawn from. (The trouble comes when you want
> to decide whether that distribution matches anything you care about, but
> looking at the data won't tell you that.) A well designed test is one that
> is correct on average across samples.
> Indeed my wording was imprecise so thanks once more for correcting
> it. Moreover you put it really well: "The trouble comes when you want to
> decide whether that distribution matches anything you care about, but
> looking at the data won't tell you that".
> Could you tell more about evaluating the correctness of a test across
> different samples? It sounds interesting.

Well, it's a relatively simple point, actually. The definition of a
good frequentist significance test is a function f(data) which returns
a p-value, and this p-value satisfies two rules:
1) When 'data' is sampled from the null hypothesis distribution, then
f(data) is uniformly distributed between 0 and 1.
2) When 'data' is sampled from an alternative distribution of
interest, then f(data) will have a distribution that is peaked near 0.

So the point is just that you can't tell whether a given function
f(data) is well-behaved or not by looking at a single value for
'data', since the requirements for being well-behaved talk only about
the distribution of f(data) given a distribution for 'data'.


More information about the SciPy-User mailing list