[SciPy-Dev] Resolving PR 235: t-statistic = 0/0 case

josef.pktd@gmai... josef.pktd@gmai...
Wed Jun 6 19:51:21 CDT 2012


On Wed, Jun 6, 2012 at 5:43 PM, Junkshops <junkshops@gmail.com> wrote:
> Hi Skipper,
>
>> Practically speaking, it's a bit of a stretch to assume that the data
>> generating process for [0,0,0] is (even approximately) normal, so I
>> think it is appropriate for the test to do some sanity checking.
>>
>> The t-test itself is only valid given that the underlying data
>> satisfies the assumptions, and I don't think a constant random
>> variable meets the requirements.
> This is pretty much identical to Nathaniel's objection, so perhaps you
> wouldn't mind responding to my argument there so we don't end up with
> separate threads on the same topic? I'll try to respond to multiple
> emails that cover similar ground at once from now on.
>
>> Until I see any math or a reference, I think returning NaN is the path
>> of least resistance.
> I have a nasty feeling this is going to look obnoxious, but I think the
> math is:
>
> Pr(MD = 0|MD = 0) = 1 where MD is the mean difference.

not at all, if this is a statement on the posterior distribution, the
first MD is our hypothesis on the true parameter, the second MD is the
MD of a single pair of samples.
If you just have one sample (of 2 series) or draw from two
populations, you cannot have a p-value of 1 (with nonzero probability)
unless you sample the entire populations (or you have a dogmatic
prior, or your model assumptions are wrong).  (I hope this has enough
qualifiers and brackets to be correct.:)


>
> That does look extremely obnoxious. Sorry. But that's the case here, to
> the best of my ability to tell. Basically, if the means are equal, it
> doesn't matter what the distribution assumption is - because the means
> are equal with 100% probability.
>
> But again, if everyone wants NaN I'll capitulate.

I'm pretty much coming to the NaN position too, at least it will make
future discussions about it less likely.

My intention before was to get a "reasonable" non-NaN value, and a
t-statistic of 1 is obtained if only one value is slightly different.
What I didn't realize 3 years ago is that finding a limit is futile as
Nathaniel mentioned above, since for every variance greater than zero
the pvalues have to be uniformly distributed if the test distribution
is correct. So we cannot find a degenerate limit. (And I got my grey
hairs for nothing in return.)

(I don't belief that users that might have a 0/0 problem will have
data with zero variance, my bet would be on discretized or correlated
data without variation, and the ttest is shaky in this case.
I played with some discretized examples but didn't get anywhere clean either.)

Stop guessing and returning nans sounds very good,
or alternatively, for practical purposes, let the user choose
zoz=np.nan (default), and zoz=0 or zoz=1 is allowed.

Josef

>
> -g
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev


More information about the SciPy-Dev mailing list