[SciPy-Dev] chi-square test for a contingency (R x C) table

Warren Weckesser warren.weckesser@enthought....
Fri Jun 18 21:02:36 CDT 2010


Bruce Southey wrote:
> On 06/17/2010 07:43 PM, Neil Martinsen-Burrell wrote:
>   
>> On 2010-06-17 11:59, Bruce Southey wrote:
>>     
>>> On 06/17/2010 10:45 AM, josef.pktd@gmail.com wrote:
>>>       
>>>> On Thu, Jun 17, 2010 at 11:31 AM, Bruce Southey<bsouthey@gmail.com>  
>>>> wrote:
>>>>
>>>>         
>>>>> On 06/17/2010 09:50 AM,josef.pktd@gmail.com  wrote:
>>>>>
>>>>>           
>>>>>> On Thu, Jun 17, 2010 at 10:41 AM, Warren Weckesser
>>>>>> <warren.weckesser@enthought.com>    wrote:
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> Bruce Southey wrote:
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> On 06/16/2010 11:58 PM, Warren Weckesser wrote:
>>>>>>>>                 
>> [...]
>>
>>     
>>>>>>>> The handling for a one way table is wrong:
>>>>>>>>                 
>>>>>>>>>>> print 'One way', chisquare_nway([6, 2])
>>>>>>>>>>>                       
>>>>>>>> (0.0, 1.0, 0, array([ 6.,  2.]))
>>>>>>>>
>>>>>>>> It should also do the marginal independence tests.
>>>>>>>>
>>>>>>>>                 
>>>>>>> As I explained in the description of the ticket and in the 
>>>>>>> docstring,
>>>>>>> this function is not intended for doing the 'one-way' goodness of 
>>>>>>> fit.
>>>>>>> stats.chisquare should be used for that.  Calling chisquare_nway 
>>>>>>> with a
>>>>>>> 1D array amounts to doing a test of independence between 
>>>>>>> groupings but
>>>>>>> only giving a single grouping, hence the trivial result.  This is
>>>>>>> intentional.
>>>>>>>               
>>>>> In expected-nway, you say that "While this function can handle a 1D
>>>>> array," but clearly it does not handle it correctly.
>>>>> If it was your intention not to do one way tables, then you *must* 
>>>>> check
>>>>> the input and reject one way tables!
>>>>>
>>>>>           
>>>>>>> I guess the question is: should there be a "clever" chi-square 
>>>>>>> function
>>>>>>> that figures out what the user probably wants to do?
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>> My issue is that the chi-squared test statistic is still calculated in
>>>>> exactly the same way for n-way tables where n>0. So it is pure
>>>>> unnecessary duplication of functionality if you require a second
>>>>> function for the one way table. I also prefer the one-stop shopping 
>>>>> approach
>>>>>
>>>>>           
>>>> just because it's chisquare doesn't mean it's the same kind of tests.
>>>> This is a test for independence or association that only makes sense
>>>> if there are at least two random variables.
>>>>         
>>> Wrong!
>>> See for example:
>>> http://en.wikipedia.org/wiki/Pearson's_chi-square_test
>>> "Pearson's chi-square is used to assess two types of comparison: tests
>>> of goodness of fit and tests of independence."
>>>
>>> The exact same test statistic is being calculated just that the
>>> hypothesis is different (which is the user's problem not the function's
>>> problem). So please separate the hypothesis from the test statistic.
>>>       
>> It is only the exact same test statistic if we know the expected cell 
>> counts.  How these expected cell counts are determined depends 
>> completely on the type of test that is being carried out.  In a 
>> goodness-of-fit test (chisquare_oneway) the proportions of each cell 
>> must be specified in the null hypothesis.  For an independence test 
>> (chisquare_nway), the expected cell counts are computed from the given 
>> data and the null hypothesis of independence.  The fact that the 
>> formula involving observed and expected numbers is the same should not 
>> obscure the fact that the expected numbers come from two completely 
>> different assumptions in the n=1 and n>1 cases.  Can you explain how 
>> the expected cell counts should be determined in the 1D case without 
>> the function making assumptions about the user's null hypothesis?
>>     
> The user's hypothesis is totally irrelevant here because you are 
> computing the TEST STATISTIC (sum of observed minus expected squared 
> divided by degree of freedom), that is available in multiple statistical 
> books and a not so good link:
> http://en.wikipedia.org/wiki/Pearson's_chi-square_test
>
> This is one reason that the current stats.chisquare function accepts 
> observed and expected values as input.
>
> So given a set of observed values and the corresponding set of expected 
> values, you calculate the test statistic and consequently a p-value (you 
> can assume that the test statistic has a chi-squared distribution but 
> nothing is stopping you from some other approach such as bootstrapping). 
> Yet, nothing in that calculation of the actual test statistic nor the 
> p-value makes any reference to the user's assumptions.
>
> Where the user's assumption enter is how they compute the expected 
> values as in your expected_nway function, which is NOT your 
> chisquare_nway function. How these expected values defined do depend on 
> the user's hypothesis and if the user does not define these then the 
> function has to make a guess of what the user wants. That guess depends 
> partly on the input (obviously independence is irrelevant for a 1way 
> table) and what the developer wants such as (marginal) independence for 
> higher order tables.
>
> Thus it is up to the user to know what assumptions they made and what 
> hypothesis is being tested when they interpret the p-value because scipy 
> is not an expert system that somehow knows what the user wants to get in 
> all cases.
>
>   
>> I believe that we CANNOT separate the test statistic from the user's 
>> null hypothesis and that is the reason that chisquare_oneway and 
>> chisquare_nway should be separate functions.  The information required 
>> to properly do a goodness-of-fit test is qualitatively different than 
>> that required to do an independence test.  I support your suggestion 
>> to reject 1D arrays as input for chisquare_nway. (With appropriate 
>> checking for arrays such as np.array([[[1, 2, 3, 4]]].)
>>     
>
> All you need to calculate the Pearson chi-squared is the observed and 
> expected values. I don't care how you obtain these values, but once you 
> do then everything afterwards is the same until the user decides on how 
> to interpret the p-value.
>
> Under your argument there needs to be a new function for every different 
> hypothesis that a user wants. So a 'goodness-of-fit test' must have a 
> separate function (actually probably a good idea) than a one-way test 
> function etc. But that is pure code bloat when the test statistic is 
> identical.
>
>   
>>>> I don't like mixing shoes and apples.
>>>>
>>>>         
>>> Then please don't.
>>>       
>> Great.  I'm glad to see that we all agree that chisquare_oneway and 
>> chisquare_nway should remain separate functions. :)
>>
>> -Neil
>>     
> Sure as we need more code duplication to increase the size of scipy and 
> confuse any user!
>   
> Bruce
>   

[Grab a cup of coffee--a long post follows!]

Bruce is correct: there is really just one chi-square statistic 
calculation. To do it, you need the observed frequencies, the expected 
frequencies, and the number of degrees of freedom.  A reason for having 
more than one function is that the default values for the expected 
frequencies, and the formulas for the degrees of freedom, are different 
depending on the hypothesis to be tested.

As an exercise, I wrote the function signature and docstring for the 
case of one chi-square function that handles both goodness of fit and 
tests for independence, and then did the same for the API with two 
functions.  I had a small set of use cases in mind.  My candidate APIs 
follow the use cases.

(A fixed font is recommended from here on.)

======================================
Chi-square Test - Elementary Use Cases
======================================

1. (Goodnes of fit, with uniform probabilities for the expected frequencies)
A six-sided die is rolled 60 times and the occurrences of each side are
counted.  The following table shows the result:

Side   1   2   3   4   5   6
----------------------------
Freq.  3  17   8  13   7  12

Is the die fair?

In this case, the expected frequencies are uniform: [10, 10, 10, 10, 10, 
10].

The number of degrees of freedom is 5.


2. (Goodness of fit, with given probabilities for the expected frequencies)
An archery target contains three levels: the bull's eye, the inner ring
and the outer ring.  By area, these make up 5%, 25% and 70% of the target,
respectively.

An archer hits the target with 100 arrows, and the number of hits in each
level are recorded:

Level   Bull's eye    Inner   Outer
-----------------------------------
Freq.      16          29      55

Are these frequencies any better than random?


In this case, the user provides the expected frequencies, as [5, 25, 70]
(or perhaps as a probability distribution [0.05, 0.25, 0.70]).

The number of degrees of freedom is 2.


3. (2x2 test of independence)
A group of 54 men and 46 women are asked whether they prefer vanilla or
chocolate ice cream.  The results are:

       Vanilla   Chocolate
Men      23        31
Women    19        27

Is the flavor preference independent of sex?

In a test of independence like this, one typically used the marginal
sums and the assumption of independence to create the expected frequencies.

The number of degrees of freedom is 1.

Yates' correction should be applied.


4. (3x3x2 test of independence)

No story, just some numbers:

Group 1
    X   Y   Z
A  18  45 113
B  32  60 321
C  74  33  67

Group 2
    X   Y   Z
A  11  33  87
B  19  45 267
C  40  19  80

Are the groupings independent?

The expected frequencies are to be computed from the marginal sums,
assuming that the factors are independent.  The number of degrees of
freedom is 12.


====================================
API: One or two (or more) functions?
====================================

Bruce has suggested implementing a "one-stop shopping" function.
Here's my take on what such a function would look like:

Single Function
---------------

def chisquare(obs, expected=None, correction=False):
    """Chi-square all-in-one function.
  
    Parameters
    ----------
    obs : array_like
        The observed frequencies

    expected : array_like, with the same shape as obs (optional)

        The expected frequencies.  If expected is given, then no matter
        what the dimension of `expected`, the number of degrees of freedom
        is equal to one less than the number of elements in `expected`.
        (This amounts to a "goodness of fit" test.)
       
        If `expected` is not given, and `obs` is 1D, the function
        assumes that expected is uniformly distributed.  The number of 
degrees
        of freedom is one less than the number of elements in `expected`.
        (This is a test of "goodness of fit" to a uniform distribution.)

        If `expected` is not given and `obs` is 2D or higher dimensional,
        the function will compute the expected frequencies based on the
        marginal sums of `obs` and assuming independence of the factors.
        The number degrees of freedom is the number of elements in `obs`,
        minus the sum of the dimensions, plus the number of dimensions 
minus 1.
        (This is the "test of independence".)

    correction : bool (optional)
        If True and the number of degrees of freedom is one, Yates' 
correction
        for continuity is applied.

    Returns
    -------
    chi2 : float
        The Chi-square statistic
    p : float
        The p-value
    dof : int
        The degrees of freedom
    expected : ndarray (same shape as `obs`)
        The expected frequencies
    """

Note that this function handles the case of a given `expected` array
that is 2D or higher dimensional.  It basically ignores the dimension
in this case and treats the problem the same as the 1D case.

I haven't included a `ddof` argument. I am following the YAGNI
principle:  all the use cases that I need are covered without `ddof`.
But that may reflect my inexperience with chi-square tests.

I prefer to not have the return type depend on the arguments, so it
always returns `expected`, even if `expected` was given as an
argument.


Two Functions
-------------

An alternative is to provide two functions.  One function is intended
for goodness-of-fit tests, which really just means that (a) the number
of degrees of freedom is one less than the number of observed
frequencies, and (b) if the expected frequencies are not given, they
are assumed to be uniform.  The other function is for testing
independence of factors.  The test only makes sense for arrays of
dimension 2 or higher. The 1D case can be handled  consistently,
but it is trivial.  This function is what I called chisquare_nway
in ticket #1203; here it is called chisquare_ind, to better reflect
its purpose.

Note that in the test for independence, an `expected` array is not given;
it is always computed by the function.


def chisquare_fit(obs, expected=None, correction=False):
    """Chi-square calculation for goodness-of-fit test.

    The number of degrees of freedom is one less than the number of
    elements in `obs`.

    Parameters
    ----------
    obs : array_like
        The observed frequencies

    expected : array_like, with the same shape as obs (optional)
        The expected frequencies. If `expected` is not given, the function
        assumes that the expected frequencies are uniformly distributed.
        That is,
            expected = float(obs.sum()) * ones_like(obs) / obs.size

    correction : bool (optional)
        If True and the number of degrees of freedom is one, Yates' 
correction
        for continuity is applied.

    Returns
    -------
    chi2 : float
        The Chi-square statistic
    p : float
        The p-value
    dof : int
        The degrees of freedom; this is obs.size - 1
    """


def chisquare_ind(obs, correction=False):
    """Chi-square calculation for test of independence.

    Each dimension of `obs` corresponds to a factor.
    The function will compute the expected frequencies based on the
    marginal sums of `obs` and assuming independence of the factors.
    The number degrees of freedom is the number of elements in `obs`,
    minus the sum of the dimensions, plus the number of dimensions minus 1.
  
    Parameters
    ----------
    obs : array_like
        The observed frequencies.  Each dimension of `obs` corresponds to
        a factor.  Note that if `obs` is 1D, this test is trivial: the
        chi-square statistic will be 0, the p-value will be one, and the
        expected values will be the same as `obs`.

    correction : bool (optional)
        If True and the number of degrees of freedom is one, Yates' 
correction
        for continuity is applied.

    Returns
    -------
    chi2 : float
        The Chi-square statistic
    p : float
        The p-value
    dof : int
        The degrees of freedom
    expected : ndarray (same shape as `obs`)
        The expected frequencies
    """


Each separate function has a somewhat simpler API compared to the
single function.  I felt I had to write more about `expected` in the
single function, because its role is conditional--it depends on its
shape.  So, following the principle that "Simple is better than complex",
I have a slight preference for the two separate functions.  However,
now that I've written the docstring, I don't think the single function
is terrible.

When applied to the use cases, there is only a trivial difference in
the APIs.  For example, using the single function:

1.
 >>> obs = [3,  17,   8,  13,   7,  12]
 >>> chisquare(obs)

2.
 >>> obs = [16, 29, 55]
 >>> expected  = [5, 25, 70]
 >>> chisquare(obs, expected)

3.
 >>> obs = [[23, 31], [19, 27]]
 >>> chisquare(obs, correction=True)

4.
 >>> obs = [...]
 >>> chisquare(obs)


To use the two functions instead, just use chisquare_fit in the first
two cases and chisquare_ind in the second two, with exactly the same
arguments.


Anyone still here?  Still awake?  Thoughts?  Preferences?


Warren



More information about the SciPy-Dev mailing list