[SciPy-dev] changes to stats t-tests
josef.pktd@gmai...
josef.pktd@gmai...
Sun Dec 21 09:56:12 CST 2008
On Sun, Dec 21, 2008 at 3:03 AM, Jarrod Millman <millman@berkeley.edu> wrote:
> On Sat, Dec 20, 2008 at 11:33 AM, <josef.pktd@gmail.com> wrote:
>> I finally looked at the t-test in more detail and I would like to make
>> three changes to the ttests
>
> Thanks for looking into this.
>
>> 1) ttest_1samp: add axis argument, it is the only ttest and similar
>> function without an axis argument. Currently the retrun for multi
>> dimensional arrays is wrong, the array is flattened for the
>> calculation of mean and variance, but the number of observations is
>> taken along axis 0 (Test suite only check for 1-dim case).
>> One problem is the default axis, the usual default axis in statistics,
>> scipy.stats is zero, but it is not always consistently used
>> ttest_rel(a,b,axis=None), ttest_ind(a, b, axis=0), sem(a, axis=0)
>
> +1 to adding axis arg to ttest_1samp. I would set the default axis to
> 0 in all cases (i.e., I would also like to see ttest_rel default to
> axis=0).
>> 2) return for t statistic: return number instead of 0-dimensional
>> array, no changes for higher dimensions
>> Same for all three tests.
>>
>> current return (t-statistic, p-value)
>> (array(0.81248591389165681), 0.41846234511362179)
>> proposed return
>> (0.81248591389165681, 0.41846234511362179)
>
> +1
>
>> 3) clean up handling of zero division problem: current zero division
>> problem doesn't make sense, it return a t-statistic arbitrarily set
>> to one. This only applies to cases with zero variance
>> proposed change, return either inf (if numerator different from
>> zero) or zero (if numerator is also zero.
>
> I would think that for a/0=x:
> a > 0 then x is +inf
> a < 0 then x is -inf
> a = 0 then x is NaN
>
> Why would x be 0 when a is 0? Regardless, I agree that the current
> behavior is incorrect and needs to be fixed.
The +inf, -inf I have implemented this way.
The case when a=0 is a/0=x is not so obvious
a = 0 and the denominator =0, means that the variance is zero and the
null hypothesis is exactly satisfied, e.g. in the independent case,
this means we are comparing two identical constant arrays. So, I
thought the t-statistic should be zero and the pvalue is one, meaning
the probability that the two random variables have the same mean is
one.
that's what I currently have:
>>> ttest_ind([0,0,0], [0,0,0])
(0.0, 1.0)
>>> ttest_ind([0,0,0], [1,1,1])
(-1.#INF, 0.0)
>>> ttest_1samp([0,0,0], 0)
(0.0, 1.0)
>>> ttest_1samp([0,0,0], 1)
(-1.#INF, 0.0)
>>> ttest_1samp([0,0,0], -1)
(1.#INF, 0.0)
However after looking at the limiting case for a=0 some more, it is
not clear to me anymore whether my initial intuition is correct. I
tried it out for a normal distribution with variance going to zero,
but I don't have time to find what the correct theoretical limit for
the t-statistic is for a degenerate normal distribution. But it seems
that it might be one, as defined in the initial zero division problem.
This appears correct to me, since the t-statistic is normalized in
such a way that it is prevented from collapsing to a degenerate random
variable.
The rejection rates for the ttest_1samp with a normal distribution
with scale 1e-8, are very close to the theoretical values. The power
to reject incorrect null hypothesis is huge. So, for any
non-degenerate case the ttest functions work correctly.
Now I think that defining t=0/0=1 might be more appropriate. For both
cases t=0/0=1 or t=0/0=0, we would not reject the null hypothesis at
any of the usual confidence levels, which is the correct answer. I
wouldn't set 0/0 = nan in this case, because it may show up in some
edge cases, and not rejecting the null is the correct answer.
t=0/0=1 is also the numerical limiting case for simple examples with
small variance (I hadn't checked this before), from current version
>>> st.ttest_rel([1e-4,0,0,0,0,0,0,0,0], [0,0,0,0,0,0,0,0,0])
(array(0.99999999999999967), 0.34659350708733427)
>>> st.ttest_ind([1e-4,0,0,0,0,0,0,0,0], [0,0,0,0,0,0,0,0,0])
(array(0.99999999999999967), 0.33219498465297947)
>>> st.ttest_1samp([1e-4,0,0,0,0,0,0,0,0], 0)
(0.99999999999999967, 0.34659350708733427)
>>> st.ttest_ind([1e-4,0,0,0,0,0,0,0,0], [1e-4,0,0,0,0,0,0,0,0])
(array(0.0), 1.0)
>
>> Since, we are just before release, I don't know if I can make these
>> changes now or after the release. Most of it fixes incorrect returns,
>> and I'm writing the tests for all cases.
>
> I would like to see these changes make it into the 0.7 release.
> Especially if you write tests and document the changes. Can you make
> the changes before the end of the weekend?
>
I should be able to commit my rewrite in a few hours, I will use
default axis zero and t=0/0=1, if there are no objections.
Josef
More information about the Scipy-dev
mailing list