[SciPy-dev] changes to stats t-tests
Sun Dec 21 09:56:12 CST 2008
On Sun, Dec 21, 2008 at 3:03 AM, Jarrod Millman <email@example.com> wrote:
> On Sat, Dec 20, 2008 at 11:33 AM, <firstname.lastname@example.org> wrote:
>> I finally looked at the t-test in more detail and I would like to make
>> three changes to the ttests
> Thanks for looking into this.
>> 1) ttest_1samp: add axis argument, it is the only ttest and similar
>> function without an axis argument. Currently the retrun for multi
>> dimensional arrays is wrong, the array is flattened for the
>> calculation of mean and variance, but the number of observations is
>> taken along axis 0 (Test suite only check for 1-dim case).
>> One problem is the default axis, the usual default axis in statistics,
>> scipy.stats is zero, but it is not always consistently used
>> ttest_rel(a,b,axis=None), ttest_ind(a, b, axis=0), sem(a, axis=0)
> +1 to adding axis arg to ttest_1samp. I would set the default axis to
> 0 in all cases (i.e., I would also like to see ttest_rel default to
>> 2) return for t statistic: return number instead of 0-dimensional
>> array, no changes for higher dimensions
>> Same for all three tests.
>> current return (t-statistic, p-value)
>> (array(0.81248591389165681), 0.41846234511362179)
>> proposed return
>> (0.81248591389165681, 0.41846234511362179)
>> 3) clean up handling of zero division problem: current zero division
>> problem doesn't make sense, it return a t-statistic arbitrarily set
>> to one. This only applies to cases with zero variance
>> proposed change, return either inf (if numerator different from
>> zero) or zero (if numerator is also zero.
> I would think that for a/0=x:
> a > 0 then x is +inf
> a < 0 then x is -inf
> a = 0 then x is NaN
> Why would x be 0 when a is 0? Regardless, I agree that the current
> behavior is incorrect and needs to be fixed.
The +inf, -inf I have implemented this way.
The case when a=0 is a/0=x is not so obvious
a = 0 and the denominator =0, means that the variance is zero and the
null hypothesis is exactly satisfied, e.g. in the independent case,
this means we are comparing two identical constant arrays. So, I
thought the t-statistic should be zero and the pvalue is one, meaning
the probability that the two random variables have the same mean is
that's what I currently have:
>>> ttest_ind([0,0,0], [0,0,0])
>>> ttest_ind([0,0,0], [1,1,1])
>>> ttest_1samp([0,0,0], 0)
>>> ttest_1samp([0,0,0], 1)
>>> ttest_1samp([0,0,0], -1)
However after looking at the limiting case for a=0 some more, it is
not clear to me anymore whether my initial intuition is correct. I
tried it out for a normal distribution with variance going to zero,
but I don't have time to find what the correct theoretical limit for
the t-statistic is for a degenerate normal distribution. But it seems
that it might be one, as defined in the initial zero division problem.
This appears correct to me, since the t-statistic is normalized in
such a way that it is prevented from collapsing to a degenerate random
The rejection rates for the ttest_1samp with a normal distribution
with scale 1e-8, are very close to the theoretical values. The power
to reject incorrect null hypothesis is huge. So, for any
non-degenerate case the ttest functions work correctly.
Now I think that defining t=0/0=1 might be more appropriate. For both
cases t=0/0=1 or t=0/0=0, we would not reject the null hypothesis at
any of the usual confidence levels, which is the correct answer. I
wouldn't set 0/0 = nan in this case, because it may show up in some
edge cases, and not rejecting the null is the correct answer.
t=0/0=1 is also the numerical limiting case for simple examples with
small variance (I hadn't checked this before), from current version
>>> st.ttest_rel([1e-4,0,0,0,0,0,0,0,0], [0,0,0,0,0,0,0,0,0])
>>> st.ttest_ind([1e-4,0,0,0,0,0,0,0,0], [0,0,0,0,0,0,0,0,0])
>>> st.ttest_1samp([1e-4,0,0,0,0,0,0,0,0], 0)
>>> st.ttest_ind([1e-4,0,0,0,0,0,0,0,0], [1e-4,0,0,0,0,0,0,0,0])
>> Since, we are just before release, I don't know if I can make these
>> changes now or after the release. Most of it fixes incorrect returns,
>> and I'm writing the tests for all cases.
> I would like to see these changes make it into the 0.7 release.
> Especially if you write tests and document the changes. Can you make
> the changes before the end of the weekend?
I should be able to commit my rewrite in a few hours, I will use
default axis zero and t=0/0=1, if there are no objections.
More information about the Scipy-dev