[SciPy-Dev] chi-square test for a contingency (R x C) table

josef.pktd@gmai... josef.pktd@gmai...
Thu Jun 3 01:09:56 CDT 2010


On Thu, Jun 3, 2010 at 12:47 AM, Neil Martinsen-Burrell
<nmb@wartburg.edu> wrote:
> On 2010-06-02 15:03 , Bruce Southey wrote:
>> On 06/02/2010 01:41 PM, josef.pktd@gmail.com wrote:
>>> On Wed, Jun 2, 2010 at 2:18 PM, Neil Martinsen-Burrell<nmb@wartburg.edu>  wrote:
>>>
>>>> On 2010-06-02 13:10 , Bruce Southey wrote:
>
> [...]
>
>>> I agree with Neil that this is a very useful convenience function.
>>>
>> My problem with the chisquare_twoway is that it should not call another
>> function to finish two lines of code. It is just an excessive waste of
>> resources.
>
> Do you mean that you would rather see the equivalent of
>
> chisq = (table - expected)**2 / expected
> return chisq, chisqprob(chisq, dof)
>
> at the bottom of chisquare_contingency than the current call to
> chisquare?  I'm certainly okay with that.

But don't forget to ravel or you get cell-wise chisquare :)
For non-performance sensitive parts, as in this case I usually go by
how easy the function is to understand and to test.
for example I prefer distributions.chi2.sf(chisq, dof) to
chisqprob(chisq, dof) (I haven't checked if it is correct because I
immediately see that it is a one-sided pvalue.

inlining in this case might be nicer because of dof (when inlining)
versus ddof (when calling chisquare), I found the ddof confusing to
read

related: while I was skimming Bruce's reference
http://faculty.vassar.edu/lowry/ch8pt2.html
I saw that they recommend continuity correction for the 2by2 case.
Do you know what the common position on continuity correction is in this case?

(In something vaguely related to this, I read recently that some
continuity correction make the test too conservative and are not
recommended. But I don't remember for which test I read this.)

If there is test specific continuity correction, then chisquare will
have to be inlined.

>
>>> I never heard of a one-way contingency table, my question was whether
>>> the function should also handle 3-way or 4-way tables, additional to
>>> two-way.
>>>
>> Correct to both of these as I just consider these as n-way tables. I
>> think that contingency tables by definition only applies to the 2-d
>> case. Pivot tables are essentially the same thing. I would have to
>> lookup on how to get the expected number of outcomes but probably of the
>> form Ni.. * N.j. *N..k/N... for the 3-way (the 2-way table is of the
>> form Ni.*N.j/N..) for i=rows, j=columns, k=3rd axis and '.' means sum
>> for that axis.
>
> That is the correct (tensor) formula for higher dimensional tables.
> Pragmatically, since the number of cells climbs so rapidly with
> increasing dimension, there are more problems with small expected
> counts.  If we thought people would be interested in using it, we could
> certainly define a chisquare_nway function as well.

I'm not too happy about having a large number of small functions
especially if they have code duplication and need to be separately
maintained.
When there is a demand for a convenient special case, then it could
just call the more general function.

For testing distribution, the common approach in the case when there
are too few expected counts in some cells, is, to combine several
cells together in one bin.
I guess, there might be something like this also feasible for nway,
i.e. coarsen the grid, or not?

>
>>> I thought about the question how the input should be specified for my
>>> initial response, the alternative would be to use the original data or
>>> a "long" format instead of a table. But I thought that as a
>>> convenience function using the table format will be the most common
>>> use.
>>> I have written in the past functions that calculate the contingency
>>> table, and would be very useful to have a more complete coverage of
>>> tools to work with contingency tables in scipy.stats (or temporarily
>>> in statsmodels, where we are working also on the anova type of
>>> analysis)
>>>
>> It depends on what tasks are needed. Really there are two steps:
>> 1) Cross-tabulation that summarized the data from whatever input
>> (groupby would help here).
>> 2) Statistical tests - series of functions that accept summarized data only.
>>
>> If you have separate functions then the burden is on the user to find
>> and call all the desired functions. You can also provide a single helper
>> function to do all that because you don't want to repeat unnecessary calls.
>
> The facilities for handling raw, frame-style data in scipy.stats are not
> too strong.  A tabulation function that we could stick together with the
> chisquare* functions to make a single helper would certainly be convenient.

Since broader coverage of contingency tables with all the data
handling, bincount and table conversions would a much larger set of
functions.

I think our still evolving design for statistics (including test) in
statsmodels is to move to a more object oriented design, to keep
things together, and to take advantage of reusing previous
calculations.

In this case it could be a ContingencyTable class that could combine
creating the countdata from raw data (with or without missing values),
marginalization if it's 3-way or higher, attach several tests, create
a nice string that can be printed, and so on. With lazy evaluation and
reuse of previous calculations, we think this would be a better design
than only having standalone functions.

grouping functions together:
While statisticians might have a good overview of all the different
test, I found the "laundry list" of functions in scipy.stats for a
long time pretty confusing.
Instead of having group of functions fisherexact, chisquare_twoway,
chisquare_nway, and several other possible candidates for independence
tests in contingency tables, we are starting to combine them together,
e.g independence_tests, mean_tests, variance_tests and
correlation_test

We were discussing this in statsmodels in a different context, mainly
diagnostic tests for regression, e.g. heteroscedasticity,
autocorrelation tests or more recently post-hoc tests.

In the current case, I also thought that combining with a fisherexact
or other tests would potentially be useful, with a keyword argument
that selects "chisquare", "exact", "..."
Which is in this case not yet relevant because fisherexact, even when
it works, is only for 2by2, and I don't think mixing them together is
very useful.

Josef



> -Neil
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>


More information about the SciPy-Dev mailing list