[SciPy-Dev] chi-square test for a contingency (R x C) table
josef.pktd@gmai...
josef.pktd@gmai...
Wed Jun 2 09:37:56 CDT 2010
On Wed, Jun 2, 2010 at 8:24 AM, Neil Martinsen-Burrell <nmb@wartburg.edu> wrote:
> On 2010-06-01 23:28 , Warren Weckesser wrote:
>> I've been digging into some basic statistics recently, and developed the
>> following function for applying the chi-square test to a contingency
>> table. Does something like this already exist in scipy.stats? If not,
>> any objects to adding it? (Tests are already written :)
>
> Something like this would be great in scipy.stats since I end up doing
> the exact same thing by hand whenever I grade introductory statistics
> exams. Thanks for writing this!
>
> I've got some code review comments that I'll include below.
>
>> def chisquare_contingency(table):
>
> I think that chiquare_twoway fits the common name for this test better,
> but as Joseph mentions, this neglects the possibility of expanding this
> to n-dimensions.
>
>> """Chi-square calculation for a contingency (R x C) table.
>
> The docstring should emphasize that this is a hypothesis test. See for
> example http://docs.scipy.org/scipy/docs/scipy.stats.stats.ttest_rel/.
> I'm not familiar with the R x C notation, but it does work to make clear
> which chi square test this is.
>
>>
>> This function computes the chi-square statistic and p-value of the
>> data in the table. The expected frequencies are computed based on
>> the relative frequencies in the table.
>
> I try to explain what the null and alternative hypotheses are for the
> tests in scipy.stats.
>
>>
>> Parameters
>> ----------
>> table : array_like, 2D
>> The contingency table, also known as the R x C table.
>
> This could also say something like "The table contains the observed
> frequencies of each category."
>
>>
>> Returns
>> -------
>> chisquare statistic : float
>> The chisquare test statistic
>> p : float
>> The p-value of the test.
>
> A function like this could really use an example, perhaps straight from
> one of the tests.
>
>> """
>> table = np.asarray(table)
>> if table.ndim != 2:
>> raise ValueError("table must be a 2D array.")
>>
>> # Create the table of expected frequencies.
>> total = table.sum()
>> row_sum = table.sum(axis=1).reshape(-1,1)
>> col_sum = table.sum(axis=0)
>> expected = row_sum * col_sum / float(total)
>
> I think that np.outer(row_sum, col_sum) is clearer than reshaping one to
> be a column vector.
>
>>
>> # Since we are passing in 1D arrays of length table.size, the default
>> # number of degrees of freedom is table.size-1.
>> # For a contingency table, the actual number degrees of freedom is
>> # (nr - 1)*(nc-1). We use the ddof argument
>> # of the chisquare function to adjust the default.
>> nr, nc = table.shape
>> dof = (nr - 1) * (nc - 1)
>> dof_adjust = (table.size - 1) - dof
>>
>> chi2, p = chisquare(np.ravel(table), np.ravel(expected),
>> ddof=dof_adjust)
>> return chi2, p
Just a thought:
I think it would be useful to have this kind of proposals on the
scipy-user list (even though it is a dev issue), just to be able to
get more feedback from potential users.
And again,
Thanks Neil, it's very nice to have the statistics in the docstrings
instead of having to run to Wikipedia
Josef
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
More information about the SciPy-Dev
mailing list