[SciPy-Dev] Resolving PR 235: t-statistic = 0/0 case
Wed Jun 6 15:18:11 CDT 2012
This mail references PR 235: https://github.com/scipy/scipy/pull/235
The PR adds a method for performing a t-test on 2 samples with unequal
or unknown variances and changes the return value when the t-statistic =
0 / 0 to 0 from the previous code's return value, 1, for all t-test
The latter change is a point of contention for the pull. I take the
position that t = 0/0 should return 0, while Josef, the primary
scipy.stats maintainer, believes that t = 0/0 should return one.
Normally I would try to resolve this directly with Josef, but
unfortunately I haven't heard from him in over two days, and the beta
release is scheduled for Saturday. As such, I'm writing the dev list to
ask what to do next.
Josef's position, to the best of my ability to understand it, is that:
J1) Adding a small amount of noise to data sets that have otherwise
equal means (e.g. [0,0,0,0], [0,0,0,1e-100]) results in a t-statistic of
1. Thus, as the mean difference approaches zero, t -> 1.
J2) A data set with no mean difference and no variance is a zero
probability event. As such, returning t = 1 is reasonable, as therefore
p = 0.317-0.5 for a two tailed test depending on the degrees of freedom,
and hence for standard values of alpha the null hypothesis will not be
rejected, but the user gets some feedback that his data is suspect.
I admit I don't completely understand the second argument. Hopefully
when Josef resurfaces he can correct my representation of his argument
My responses to these arguments are:
J1) If you take the n-length vectors (x,...,x) and (x,...x,x+y) and
solve for t, t = 1 for any value of x and y. This is simply due to the
fact that the the mean difference is y/n and the pooled variance (e.g.
the denominator of the t-statistic) is also y/n. Thus this is merely a
special case and does not represent a limit as the mean difference
approaches zero, since even for [0,0,0,0], [0,0,0,1e100] t = 1.
J2) Strictly speaking, if we're pulling independent random samples from
a normal distribution over the reals, say, any given set of samples has
zero probability since there are an infinite number of possible samples.
The sample [0,0,0] is just as probable as the sample [2.3, 7.4, 2.1]. In
a discrete distribution, in fact, the sample [0,0,0] is *more* likely
than any other sample. Also, we're now analyzing the legitimacy of the
user's data, which is not the job of the t-test. The t-test is expected
to take a data set(s) and provide a measure of the strength of the
evidence that the means are different. Is it not expected to comment on
the validity of the user's data.
My arguments that 0 is the correct answer to return are:
1) if MD = mean difference and PV = pooled variance, then t = MD/PV. As
MD -> 0 for any PV > 0, t -> 0. However, if we set t = 0/0 = 1 and MD =
0, as PV -> 0 we introduce a discontinuity at zero since t = 0 for any
value of PV except 0, where t = 1. This implies that for MD = PV = 0, t
= 1, but if the variance of the dataset is infinitesimally small but
!=0, then t = 0. To me, this makes no sense.
2) The t-test's purpose is to measure the strength of the evidence
against the null hypothesis that MD = 0. If MD = 0, as in the case we're
discussing, by definition there is no evidence that MD != 0. Therefore p
must = 1, and as a consequence t must = 0.
I also consulted with a statistics professor - who agreed with my
position - to make sure I wasn't talking out of turn.
In summary, I think that if t = 0/0 we should return 0. However, both R
and Matlab return NaN. To me this seems incorrect due to argument #2.
Josef also mentioned there were users of scipy.stats.stats that didn't
want NaN return values for some reason that I don't know offhand.
How would the Scipy devs like to proceed?
More information about the SciPy-Dev