[SciPy-Dev] chi-square test for a contingency (R x C) table
Neil Martinsen-Burrell
nmb@wartburg....
Wed Jun 2 07:24:25 CDT 2010
On 2010-06-01 23:28 , Warren Weckesser wrote:
> I've been digging into some basic statistics recently, and developed the
> following function for applying the chi-square test to a contingency
> table. Does something like this already exist in scipy.stats? If not,
> any objects to adding it? (Tests are already written :)
Something like this would be great in scipy.stats since I end up doing
the exact same thing by hand whenever I grade introductory statistics
exams. Thanks for writing this!
I've got some code review comments that I'll include below.
> def chisquare_contingency(table):
I think that chiquare_twoway fits the common name for this test better,
but as Joseph mentions, this neglects the possibility of expanding this
to n-dimensions.
> """Chi-square calculation for a contingency (R x C) table.
The docstring should emphasize that this is a hypothesis test. See for
example http://docs.scipy.org/scipy/docs/scipy.stats.stats.ttest_rel/.
I'm not familiar with the R x C notation, but it does work to make clear
which chi square test this is.
>
> This function computes the chi-square statistic and p-value of the
> data in the table. The expected frequencies are computed based on
> the relative frequencies in the table.
I try to explain what the null and alternative hypotheses are for the
tests in scipy.stats.
>
> Parameters
> ----------
> table : array_like, 2D
> The contingency table, also known as the R x C table.
This could also say something like "The table contains the observed
frequencies of each category."
>
> Returns
> -------
> chisquare statistic : float
> The chisquare test statistic
> p : float
> The p-value of the test.
A function like this could really use an example, perhaps straight from
one of the tests.
> """
> table = np.asarray(table)
> if table.ndim != 2:
> raise ValueError("table must be a 2D array.")
>
> # Create the table of expected frequencies.
> total = table.sum()
> row_sum = table.sum(axis=1).reshape(-1,1)
> col_sum = table.sum(axis=0)
> expected = row_sum * col_sum / float(total)
I think that np.outer(row_sum, col_sum) is clearer than reshaping one to
be a column vector.
>
> # Since we are passing in 1D arrays of length table.size, the default
> # number of degrees of freedom is table.size-1.
> # For a contingency table, the actual number degrees of freedom is
> # (nr - 1)*(nc-1). We use the ddof argument
> # of the chisquare function to adjust the default.
> nr, nc = table.shape
> dof = (nr - 1) * (nc - 1)
> dof_adjust = (table.size - 1) - dof
>
> chi2, p = chisquare(np.ravel(table), np.ravel(expected),
> ddof=dof_adjust)
> return chi2, p
More information about the SciPy-Dev
mailing list