[SciPy-Dev] chi-square test for a contingency (R x C) table
Mon Jun 7 10:00:35 CDT 2010
On 06/07/2010 09:15 AM, email@example.com wrote:
> On Fri, Jun 4, 2010 at 2:12 PM,<firstname.lastname@example.org> wrote:
>> On Fri, Jun 4, 2010 at 1:08 PM, Bruce Southey<email@example.com> wrote:
>>> On 06/03/2010 08:27 AM, Warren Weckesser wrote:
>>>> Just letting you know that I'm not ignoring all the great comments from
>>>> josef, Neil and Bruce about my suggestion for chisquare_contingency.
>>>> Unfortunately, I won't have time to think about all the deeper
>>>> suggestions for another week or so. For now, I'll just say that I
>>>> agree with josef's and Neil's suggestions for the docstring, and that
>>>> Neil's summary of the function as simply a convenience function that
>>>> calls stats.chisquare with appropriate arguments to perform a test of
>>>> independence on a contingency table is exactly what I had in mind.
>>> I looked at how SAS handles n-way tables. What it appears to do is break the
>>> original table down into a set of 2-way tables and does the analysis on each
>>> of these. So a 3 by 4 by 5 table is processed as three 2-way tables with the
>>> results of each 4 by 5 table presented. I do not know how Stata and R
>>> analysis analyze n-way tables.
>>> Consequently, I rewrote my suggested code (attached) to handle 3 and 4 way
>>> tables by using recursion. There should be some Python way to do that
>>> recursion for any number of dimensions. I also added the 1-way table (but
>>> that has a different hypothesis than the 2-way table) so users can send a
>>> 1-d table.
>> (very briefly because I don't have much time today)
>> I think, these are good extensions, but to handle all cases, the
>> function is getting too large and would need several options.
>> On your code and SAS, Z(correct me if my quick reading is wrong)
>> You seem to be calculating conditional independence for the last two
>> variables conditional on the values of the first variables. I think
>> this could be generalized to all pairwise independence tests.
>> Similar, I'm a bit surprised that SAS uses conditional and not
>> marginal independence, I would have thought that the test for marginal
>> independence (aggregate out all but 2 variables) would be the more
>> common use case.
You can argue SAS's formulation relates to how the table is constructed
because the hypothesis associated with the table is dependent on how the
user constructs it. For example, the 3-way table A by (B by C) is very
different from the 3-way table C by (B by A) yet these involve the same
underlying numbers. If a user did not specify an order then considering
all possible hypotheses is an option.
Really log-linear models are a better approach to analysis n-way tables
because these allow you to examine all these different hypotheses.
> just some more questions and comments (until I have time to check this)
> looking at conditional independence looks similar to linear regression
> models, where the effect of other variables is taken out. However,
> looking at all chisquare tests (conditional on all possible other
> values) runs into the multiple test problem. Is the some kind of
> post-hoc or Bonferroni correction or is there a distribution for eg.
> the max of all chisquare test statistics.
Ignoring my views on this, first 'multiple test problems' do not change
the probability calculation for most approaches to compute the 'raw'
p-value as the vast majority of the approaches require the 'raw' p-value.
Second, it is very easy to say 'correct for multiple tests' but that is
pure ignorance when 'what' you are correcting is for is not stated. If
you are correcting the 'family-wise error rate' then you need to
correctly define 'family-wise' in this situation especially to address
at least one other assumption being made.
> with an iterator (numpy mailinglist), my version for the conditional
> independence of the last two variables for all values of the earlier
> variables looks like
> for ind in allbut2ax_iterator(table3, axes=(-2,-1)):
> print chisquare_contingency(table3[ind])
I would have to see.
>> Initially, I was thinking just about independence of all variables in
>> a 3 or more way table, i.e. P(x,y,z)=P(x)*P(y)*P(z)
>> My opinion is that these variations of tests would fit better in a
>> class where all pairwise conditional, and marginal and joint
>> hypotheses can be supplied as methods, or split it up into a group of
>>> The data used is from two SAS examples and I added a dimension to get a
>>> 4-way table. I included the SAS values but these are only to 4 decimal
>>> places for reference.
>>> What is missing:
>>> 1) Docstring and tests but those are dependent what is ultimately decided
>>> 2) Other test statistics but scipy.stats versions are not very friendly in
>>> that these do not accept a 2-d array
>>> 3) A way to do recursion
>>> 4) Ability to label the levels etc.
>>> 5) Correct handling of input types.
>>> SciPy-Dev mailing list
> SciPy-Dev mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the SciPy-Dev