[SciPy-Dev] chi-square test for a contingency (R x C) table
Wed Jun 2 15:03:07 CDT 2010
On 06/02/2010 01:41 PM, firstname.lastname@example.org wrote:
> On Wed, Jun 2, 2010 at 2:18 PM, Neil Martinsen-Burrell<email@example.com> wrote:
>> On 2010-06-02 13:10 , Bruce Southey wrote:
>>>>> However, this code is the chi-squared test part as SAS will compute the
>>>>> actual cell numbers. Also an extension to scipy.stats.chisquare() so we
>>>>> can not have both functions.
>>>> Again, I don't understand what you mean that we can't have both
>>>> functions? I believe (from a statistics teacher's point of view) that
>>>> the Chi-Squared goodness of fit test (which is stats.chisquare) is a
>>>> different beast from the Chi-Square test for independence (which is
>>>> stats.chisquare_contingency). The fact that the distribution of the
>>>> test statistic is the same should not tempt us to put them into the
>>>> same function.
>>> Please read scipy.stats.chisquare() because scipy.stats.chisquare() is
>>> the 1-d case of yours.
>>> Quote from the docstring:
>>> " The chi square test tests the null hypothesis that the categorical data
>>> has the given frequencies."
>>> Also go the web site provided in the docstring.
>>> By default you get the expected frequencies but you can also put in your
>>> own using the f_exp variable. You could do the same in your code.
>> In fact, Warren correctly used stats.chisquare with the expected
>> frequencies calculated from the null hypothesis and the corrected
>> degrees of freedom. chisquare_contingency is in some sense a
>> convenience method for taking care of these pre-calculations before
>> calling stats.chisquare. Can you explain more clearly to me why we
>> should not include such a convenience function?
> Just a clarification, before I find time to work my way through the
> other comments
> stats.chisquare is a generic test for goodness-of-fit for discreted or
> binned distributions.
> and from the docstring of it
> "If no expected frequencies are given, the total
> N is assumed to be equally distributed across all groups."
> default is uniform distribution
The use of the uniform distribution is rather misleading and technically
wrong as it does not help address the expected number of outcomes in a cell:
> chisquare_twoway is a special case that additional calculates the
> correct expected frequencies for the test of independencs based on the
> margin totals. The resulting distribution is not uniform.
Actually the null hypothesis is rather different between 1-way and 2-way
tables so you can not say that chisquare_twoway is a special case of
I am not sure what you mean by the 'resulting distribution is not
uniform'. The distribution of the cells values has nothing to do with
the uniform distribution in either case because it is not used in the
data nor in the formulation of the test. (And, yes, I have had to do the
proof that the test statistic is Chi-squared - which is why there is the
warning about small cells...).
> I agree with Neil that this is a very useful convenience function.
My problem with the chisquare_twoway is that it should not call another
function to finish two lines of code. It is just an excessive waste of
> I never heard of a one-way contingency table, my question was whether
> the function should also handle 3-way or 4-way tables, additional to
Correct to both of these as I just consider these as n-way tables. I
think that contingency tables by definition only applies to the 2-d
case. Pivot tables are essentially the same thing. I would have to
lookup on how to get the expected number of outcomes but probably of the
form Ni.. * N.j. *N..k/N... for the 3-way (the 2-way table is of the
form Ni.*N.j/N..) for i=rows, j=columns, k=3rd axis and '.' means sum
for that axis.
> I thought about the question how the input should be specified for my
> initial response, the alternative would be to use the original data or
> a "long" format instead of a table. But I thought that as a
> convenience function using the table format will be the most common
> I have written in the past functions that calculate the contingency
> table, and would be very useful to have a more complete coverage of
> tools to work with contingency tables in scipy.stats (or temporarily
> in statsmodels, where we are working also on the anova type of
It depends on what tasks are needed. Really there are two steps:
1) Cross-tabulation that summarized the data from whatever input
(groupby would help here).
2) Statistical tests - series of functions that accept summarized data only.
If you have separate functions then the burden is on the user to find
and call all the desired functions. You can also provide a single helper
function to do all that because you don't want to repeat unnecessary calls.
> So, I think the way it is it is a nice function and we don't have to
> put all contingency table analysis into this function.
>>>>> Really this should be combined with fisher.py in ticket 956:
>>>> Wow, apparently I have lots of disagreements today, but I don't think
>>>> that this should be combined with Fisher's Exact test. (I would like
>>>> to see that ticket mature to the point where it can be added to
>>>> scipy.stats.) I like the functions in scipy.stats to correspond in a
>>>> one-to-one manner with the statistical tests. I think that the docs
>>>> should "See Also" the appropriate exact (and non-parametric) tests,
>>>> but I think that one function/one test is a good rule. This is
>>>> particularly true for people (like me) who would like to someday be
>>>> able to use scipy.stats in a pedagogical context.
>>> I don't see any 'disagreements' rather just different ways to do things
>>> and identifying areas that need to be addressed for more general use.
>> Agreed. :)
>> SciPy-Dev mailing list
> SciPy-Dev mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the SciPy-Dev