[Scipy-tickets] [SciPy] #1203: stats: chi-square test of independence

SciPy Trac scipy-tickets@scipy....
Thu Jun 17 00:06:10 CDT 2010


#1203: stats: chi-square test of independence
----------------------------------------------+-----------------------------
 Reporter:  warren.weckesser                  |       Owner:  somebody
     Type:  enhancement                       |      Status:  new     
 Priority:  normal                            |   Milestone:  0.8.0   
Component:  scipy.stats                       |     Version:  0.7.0   
 Keywords:  chi-square chisquare chi-squared  |  
----------------------------------------------+-----------------------------

Old description:

> The attached file, chisquare_nway.py, includes the function
> chisquare_nway() that computes a chi-square test of independence for an
> n-dimensional array.  I think this would be a nice enhancement for
> scipy.stats.  A discussion on this topic (and an early, limited version
> of the function) can be found here:
>
>     http://mail.scipy.org/pipermail/scipy-dev/2010-June/014538.html
>
> Inspired by that discussion, I generalized the function, added the
> optional Yates' correction for continuity, and figured out how to do the
> equivalent calculation in R for comparison.
>
> For now, I have attached just a standalone python file.  After getting
> some feedback (especially about the API), I'll create
> a patch containing the code, tests, and updates to the module
> docs and release notes.
>
> Some additional comment about the code:
>
> I have included the degrees of freedom and the table of expected
> frequencies in the output.  This is convenient for comparing to R, and
> they are just handy to have available.
>
> I implemented the Yates correction for continuity, but it is only
> allowed when the degrees of freedom is 1.  Everything I have read
> seems to suggest that the correction is for this case only, but I
> have not dug very deeply. In particular, I haven't looked up the
> original reference.
>
> As far as I can tell, R's chisq.test does not handle three-way
> or higher tests.  chisq.test does a one-way test of goodness of
> fit, or a two-way test of independence.  So I would like to
> emphasize that chisquare_nway is *not* an attempt to clone the
> R function chisq.test.  chisquare_nway does not do the 'one-way'
> goodness of fit test; use stats.chisquare for that.
>
> The file chisq4x3x2.r contains R code that *does* do a three-way
> test.  This code prints the following:
> {{{
> Call: xtabs(formula = count ~ r + c + t)
> Number of cases in table: 478
> Number of factors: 3
> Test for independence of all factors:
>         Chisq = 102.17, df = 17, p-value = 3.514e-14
> }}}
>
> The equivalent calculation using chisquare_nway:
> {{{
> >>> data = np.array(
>     [[[12, 34, 23],
>       [35, 31, 11],
>       [12, 32,  9],
>       [12, 12, 14]],
>      [[ 4, 47, 11],
>       [34, 10, 18],
>       [18, 13, 19],
>       [ 9, 33, 25]]])
> >>> chisquare_nway(data)
> (102.17314893322093,
>  3.5141225742891105e-14,
>  17,
>  array([[[ 18.48003361,  28.80711122,  17.66473801],
>         [ 19.60858528,  30.56632412,  18.74350064],
>         [ 14.53010276,  22.64986607,  13.88906882],
>         [ 14.81224068,  23.0896693 ,  14.15875948]],
>
>        [[ 18.79193291,  29.29330719,  17.96287705],
>         [ 19.93953187,  31.08221145,  19.05984664],
>         [ 14.77533657,  23.03214229,  14.12348348],
>         [ 15.06223631,  23.47936836,  14.39772588]]]))
> }}}
>
> Similarly, chisq2x2x2x2.r prints:
> {{{
> Call: xtabs(formula = data ~ r + c + d + t)
> Number of cases in table: 262
> Number of factors: 4
> Test for independence of all factors:
>         Chisq = 8.758, df = 11, p-value = 0.6442
> }}}
>
> This is the same data as the second example in the docstring.
> chisquare_nway matches R to the precision printed by R.

New description:

 The attached file, chisquare_nway.py, includes the function
 chisquare_nway() that computes a chi-square test of independence for an
 n-dimensional array.  I think this would be a nice enhancement for
 scipy.stats.  A discussion on this topic (and an early, limited version of
 the function) can be found here:

     http://mail.scipy.org/pipermail/scipy-dev/2010-June/014538.html

 Inspired by that discussion, I generalized the function, added the
 optional Yates' correction for continuity, and figured out how to do the
 equivalent calculation in R for comparison.

 For now, I have attached just a standalone python file.  After getting
 some feedback (especially about the API), I'll create
 a patch containing the code, tests, and updates to the module
 docs and release notes.

 Some additional comment about the code:

 I have included the degrees of freedom and the table of expected
 frequencies in the output.  This is convenient for comparing to R, and
 they are just handy to have available.

 I implemented the Yates correction for continuity, but it is only
 allowed when the degrees of freedom is 1.  Everything I have read
 seems to suggest that the correction is for this case only, but I
 have not dug very deeply. In particular, I haven't looked up the
 original reference.

 As far as I can tell, R's chisq.test does not handle three-way
 or higher tests.  chisq.test does a one-way test of goodness of
 fit, or a two-way test of independence.  So I would like to
 emphasize that chisquare_nway is *not* an attempt to clone the
 R function chisq.test.  chisquare_nway does not do the 'one-way'
 goodness of fit test; use stats.chisquare for that.

 The file chisq4x3x2.r contains R code that *does* do a three-way
 test.  This code prints the following:
 {{{
 Call: xtabs(formula = data ~ r + c + t)
 Number of cases in table: 478
 Number of factors: 3
 Test for independence of all factors:
         Chisq = 102.17, df = 17, p-value = 3.514e-14
 }}}

 The equivalent calculation using chisquare_nway:
 {{{
 >>> data = np.array(
     [[[12, 34, 23],
       [35, 31, 11],
       [12, 32,  9],
       [12, 12, 14]],
      [[ 4, 47, 11],
       [34, 10, 18],
       [18, 13, 19],
       [ 9, 33, 25]]])
 >>> chisquare_nway(data)
 (102.17314893322093,
  3.5141225742891105e-14,
  17,
  array([[[ 18.48003361,  28.80711122,  17.66473801],
         [ 19.60858528,  30.56632412,  18.74350064],
         [ 14.53010276,  22.64986607,  13.88906882],
         [ 14.81224068,  23.0896693 ,  14.15875948]],

        [[ 18.79193291,  29.29330719,  17.96287705],
         [ 19.93953187,  31.08221145,  19.05984664],
         [ 14.77533657,  23.03214229,  14.12348348],
         [ 15.06223631,  23.47936836,  14.39772588]]]))
 }}}

 Similarly, chisq2x2x2x2.r prints:
 {{{
 Call: xtabs(formula = data ~ r + c + d + t)
 Number of cases in table: 262
 Number of factors: 4
 Test for independence of all factors:
         Chisq = 8.758, df = 11, p-value = 0.6442
 }}}

 This is the same data as the second example in the docstring.
 chisquare_nway matches R to the precision printed by R.

--

Comment(by warren.weckesser):

 Fixed R output--the original was output from an older version of the R
 code.

-- 
Ticket URL: <http://projects.scipy.org/scipy/ticket/1203#comment:1>
SciPy <http://www.scipy.org>
SciPy is open-source software for mathematics, science, and engineering.


More information about the Scipy-tickets mailing list