[SciPy-Dev] chi-square test for a contingency (R x C) table

Bruce Southey bsouthey@gmail....
Thu Jun 3 10:05:45 CDT 2010

```On 06/03/2010 01:48 AM, josef.pktd@gmail.com wrote:
> On Wed, Jun 2, 2010 at 4:03 PM, Bruce Southey<bsouthey@gmail.com>  wrote:
>
>> On 06/02/2010 01:41 PM, josef.pktd@gmail.com wrote:
>>
>> On Wed, Jun 2, 2010 at 2:18 PM, Neil Martinsen-Burrell<nmb@wartburg.edu>
>> wrote:
>>
>>
>> On 2010-06-02 13:10 , Bruce Southey wrote:
>> [...]
>>
>>
>>
>> However, this code is the chi-squared test part as SAS will compute the
>> actual cell numbers. Also an extension to scipy.stats.chisquare() so we
>> can not have both functions.
>>
>>
>> Again, I don't understand what you mean that we can't have both
>> functions? I believe (from a statistics teacher's point of view) that
>> the Chi-Squared goodness of fit test (which is stats.chisquare) is a
>> different beast from the Chi-Square test for independence (which is
>> stats.chisquare_contingency). The fact that the distribution of the
>> test statistic is the same should not tempt us to put them into the
>> same function.
>>
>>
>> the 1-d case of yours.
>> Quote from the docstring:
>> " The chi square test tests the null hypothesis that the categorical data
>> has the given frequencies."
>> Also go the web site provided in the docstring.
>>
>> By default you get the expected frequencies but you can also put in your
>> own using the f_exp variable. You could do the same in your code.
>>
>>
>> In fact, Warren correctly used stats.chisquare with the expected
>> frequencies calculated from the null hypothesis and the corrected
>> degrees of freedom.  chisquare_contingency is in some sense a
>> convenience method for taking care of these pre-calculations before
>> calling stats.chisquare.  Can you explain more clearly to me why we
>> should not include such a convenience function?
>>
>>
>> Just a clarification, before I find time to work my way through the
>>
>> stats.chisquare is a generic test for goodness-of-fit for discreted or
>> binned distributions.
>> and from the docstring of it
>> "If no expected frequencies are given, the total
>>      N is assumed to be equally distributed across all groups."
>>
>> default is uniform distribution
>>
>>
>>
>> Try:
>> http://en.wikipedia.org/wiki/Pearson's_chi-square_test
>>
>> The use of the uniform distribution is rather misleading and technically
>> wrong as it does not help address the expected number of outcomes in a cell:
>>
> "A simple example is the hypothesis that an ordinary six-sided dice is
> "fair", i.e., all six outcomes are equally likely to occur."
>
> I don't see anything misleading or technically wrong with the uniform
> distributions,
> or if they come from a Poisson, Hypergeometric, binned Normal or any
> of number of other distributions.
>
Okay this must be only for the 1-way table as it does not apply to the
2-way or higher tables where the test is for independence between
variables.

There are valid technical reasons why it is misleading because saying
that a random variable comes from some distribution has immutable
meaning. Obviously if a random variable comes from the discrete uniform
distribution then that random variable also must have a mean (N+1)/2,
variance (N+1)*(N-1)/12 etc. There is nothing provided about the moments
of the random variable provided under the null hypothesis so you can not
say what distribution that a random variable is from. For example, the
random variable could be from a beta-binomial distribution (as when
alpha=beta=1 this is the discrete uniform) or binomial/multinomial with
equal probabilities such that the statement 'all [the] outcomes are
equally likely to occur' remains true.

If you assume that your random variables are discrete uniform or any
other distribution (except normal) then in general you can not assume
that the Pearson's chi-squared test statistic has a specific
distribution. However, in this case the Pearson's chi-squared test
statistic is asymptotically chi-squared because of the normality
assumption. So provided the central limit theorem is valid (not
necessarily true for all distributions and for 'small' sample sizes)
then this test will be asymptotically valid regardless of the assumption
of the random variables in this case.

>> http://en.wikipedia.org/wiki/Discrete_uniform_distribution
>>
>>
>> chisquare_twoway is a special case that additional calculates the
>> correct expected frequencies for the test of independencs based on the
>> margin totals. The resulting distribution is not uniform.
>>
>>
>> Actually the null hypothesis is rather different between 1-way and 2-way
>> tables so you can not say that chisquare_twoway is a special case of
>> chisquare.
>>
> What is the Null hypothesis in a one-way table?
>
> Josef
>
>
SAS definition for 1-way table: "the null hypothesis specifies equal
proportions of the total sample size for each class". This is not the
same as saying a discrete uniform distribution as you are not directly
testing that the cells have equal probability. But the ultimate outcome
is probably not any different.

Bruce

>> I am not sure what you mean by the 'resulting distribution is not uniform'.
>> The distribution of the cells values has nothing to do with the uniform
>> distribution in either case because it is not used in the data nor in the
>> formulation of the test. (And, yes, I have had to do the proof that the test
>> statistic is Chi-squared - which is why there is the warning about small
>> cells...).
>>
>> I agree with Neil that this is a very useful convenience function.
>>
>>
>> My problem with the chisquare_twoway is that it should not call another
>> function to finish two lines of code. It is just an excessive waste of
>> resources.
>>
>> I never heard of a one-way contingency table, my question was whether
>> the function should also handle 3-way or 4-way tables, additional to
>> two-way.
>>
>>
>> Correct to both of these as I just consider these as n-way tables. I think
>> that contingency tables by definition only applies to the 2-d case. Pivot
>> tables are essentially the same thing. I would have to lookup on how to get
>> the expected number of outcomes but probably of the form Ni.. * N.j.
>> *N..k/N... for the 3-way (the 2-way table is of the form Ni.*N.j/N..) for
>> i=rows, j=columns, k=3rd axis and '.' means sum for that axis.
>>
>> I thought about the question how the input should be specified for my
>> initial response, the alternative would be to use the original data or
>> a "long" format instead of a table. But I thought that as a
>> convenience function using the table format will be the most common
>> use.
>>
>> I have written in the past functions that calculate the contingency
>> table, and would be very useful to have a more complete coverage of
>> tools to work with contingency tables in scipy.stats (or temporarily
>> in statsmodels, where we are working also on the anova type of
>> analysis)
>>
>>
>> It depends on what tasks are needed.  Really there are two steps:
>> 1) Cross-tabulation that summarized the data from whatever input (groupby
>> would help here).
>> 2) Statistical tests - series of functions that accept summarized data only.
>>
>> If you have separate functions then the burden is on the user to find and
>> call all the desired functions. You can also provide a single helper
>> function to do all that because you don't want to repeat unnecessary calls.
>>
>> So, I think the way it is it is a nice function and we don't have to
>> put all contingency table analysis into this function.
>>
>> Josef
>>
>>
>> Bruce
>>
>>
>>
>>
>>
>> Really this should be combined with fisher.py in ticket 956:
>> http://projects.scipy.org/scipy/ticket/956
>>
>>
>> Wow, apparently I have lots of disagreements today, but I don't think
>> that this should be combined with Fisher's Exact test. (I would like
>> to see that ticket mature to the point where it can be added to
>> scipy.stats.) I like the functions in scipy.stats to correspond in a
>> one-to-one manner with the statistical tests. I think that the docs
>> but I think that one function/one test is a good rule. This is
>> particularly true for people (like me) who would like to someday be
>> able to use scipy.stats in a pedagogical context.
>>
>> -Neil
>>
>>
>> I don't see any 'disagreements' rather just different ways to do things
>> and identifying areas that need to be addressed for more general use.
>>
>>
>> Agreed. :)
>>
>> [...]
>>
>> -Neil
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>>
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>>
>>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-dev/attachments/20100603/3a8d541f/attachment-0001.html
```