# [SciPy-Dev] chi-square test for a contingency (R x C) table

josef.pktd@gmai... josef.pktd@gmai...
Mon Jun 7 13:30:57 CDT 2010

```On Mon, Jun 7, 2010 at 12:45 PM, Bruce Southey <bsouthey@gmail.com> wrote:
> On 06/07/2010 10:45 AM, josef.pktd@gmail.com wrote:
>> On Mon, Jun 7, 2010 at 11:00 AM, Bruce Southey<bsouthey@gmail.com>  wrote:
>>
>>> On 06/07/2010 09:15 AM, josef.pktd@gmail.com wrote:
>>>
>>> On Fri, Jun 4, 2010 at 2:12 PM,<josef.pktd@gmail.com>  wrote:
>>>
>>>
>>> On Fri, Jun 4, 2010 at 1:08 PM, Bruce Southey<bsouthey@gmail.com>  wrote:
>>>
>>>
>>> On 06/03/2010 08:27 AM, Warren Weckesser wrote:
>>>
>>>
>>> Just letting you know that I'm not ignoring all the great comments from
>>> josef, Neil and Bruce about my suggestion for chisquare_contingency.
>>> Unfortunately, I won't have time to think about all the deeper
>>> suggestions for another week or so.   For now, I'll just say that I
>>> agree with josef's and Neil's suggestions for the docstring, and that
>>> Neil's summary of the function as simply a convenience function that
>>> calls stats.chisquare with appropriate arguments to perform a test of
>>> independence on a contingency table is exactly what I had in mind.
>>>
>>> Warren
>>>
>>>
>>>
>>>
>>>
>>> Hi,
>>> I looked at how SAS handles n-way tables. What it appears to do is break the
>>> original table down into a set of 2-way tables and does the analysis on each
>>> of these. So a 3 by 4 by 5 table is processed as three 2-way tables with the
>>> results of each 4 by 5 table presented. I do not know how Stata and R
>>> analysis analyze n-way tables.
>>>
>>> Consequently, I rewrote my suggested code (attached) to handle 3 and 4 way
>>> tables by using recursion. There should be some Python way to do that
>>> recursion for any number of dimensions. I also added the 1-way table (but
>>> that has a different hypothesis than the 2-way table) so users can send a
>>> 1-d table.
>>>
>>>
>>> (very briefly because I don't have much time today)
>>>
>>> I think, these are good extensions, but to handle all cases, the
>>> function is getting too large and would need several options.
>>>
>>> On your code and SAS, Z(correct me if my quick reading is wrong)
>>> You seem to be calculating conditional independence for the last two
>>> variables conditional on the values of the first variables. I think
>>> this could be generalized to all pairwise independence tests.
>>>
>>> Similar, I'm a bit surprised that SAS uses conditional and not
>>> marginal independence, I would have thought that the test for marginal
>>> independence (aggregate out all but 2 variables) would be the more
>>> common use case.
>>>
>>>
>>> You can argue SAS's formulation relates to how the table is constructed
>>> because the hypothesis associated with the table is dependent on how the
>>> user constructs it. For example, the 3-way table A by (B by C) is very
>>> different from the 3-way table C by (B by A) yet these involve the same
>>> underlying numbers. If a user did not specify an order then considering all
>>> possible hypotheses is an option.
>>>
>> I don't know the SAS notation, what I thought in analogy to regression
>> analysis, is that if one variable is considered as endogenous, then
>> only pairwise tests with this variable need to be included.
>>
> This is not the same as regression for multiple reasons. Here we are
> testing independence without any distribution assumption associated with
> the actual data. (Of course under the normality assumption then these
> are the same. )
>
>>
>>> Really log-linear models are a better approach to analysis n-way tables
>>> because these allow you to examine all these different hypotheses.
>>>
>>> just some more questions and comments (until I have time to check this)
>>>
>>> looking at conditional independence looks similar to linear regression
>>> models, where the effect of other variables is taken out. However,
>>> looking at all chisquare tests (conditional on all possible other
>>> values) runs into the multiple test problem. Is the some kind of
>>> post-hoc or Bonferroni correction or is there a distribution for eg.
>>> the max of all chisquare test statistics.
>>>
>>>
>>> Ignoring my views on this, first 'multiple test problems' do not change the
>>> probability calculation for most approaches to compute the 'raw' p-value as
>>> the vast majority of the approaches require the 'raw' p-value.
>>>
>>> Second, it is very easy to say 'correct for multiple tests' but that is pure
>>> ignorance when 'what' you are correcting is for is not stated. If you are
>>> correcting the 'family-wise error rate' then you need to correctly define
>>> 'family-wise' in this situation especially to address at least one other
>>>
> In a 2-way table there is no need for any correction so it is pointless
> to say 'correct for multiple tests'. In a 3-way or higher table, as you
> indicated, is essentially a test of conditional independence as I
> implemented it. It is also pointless to say 'correct for multiple tests'
> because you are first assuming conditional independence between say A by
> B given C=1 and A by B for C=2. So what happens when C=1 is independent
> of when C=2 so these do belong to different 'families'. Second, there is
> nothing said about the relation of either A  or B with C - which may be
> a more critical problem.
>
>> We
>> recently had the discussion about multiple tests in the context of
>>
> I am perhaps too aware of multiple testing and unfortunately these types
> of discussions go on and on and on. A lot depends on which of many
> 'schools' of thought you subscribe to. It basically amounts to 'hand
> waving'  with no solution because these schools are defined by different
> fundamental  assumptions that can not be challenged. Ultimately none are
> correct because we never know the true situation - if we did we would
> not be doing it.

I think it depends on the hypothesis and the general statistical
theory is relatively clear, but maybe some people prefer a
"test-mining" approach.

>> In econometrics, there is an extensive literature on this, and some
>> cases like structural change tests with unknown change points I know
>> pretty well.
>>
>> The main point that I wanted to make is, that multiple change tests
>> need more attention and at least a warning in the docstring which
>> (raw) p-values are reported, since it is easy for unwary users to
>> misinterpret the reported p-values. But hopefully this could be
>> extended to provide the user with options to do an appropriate
>> correction.
>>
>> Josef
>>
> This is pointless because you are misunderstanding what is meant by
> 'multiple test correction'.

???

> Placing those kinds of statements in the
> wrong places also reflects ignorance especially when the correct value
> maybe given and there is no 'appropriate' correction possible. Further
> no statement is ever going to protect users from misinterpreting p-values.

Doing a quick search on the recent literature, it seems there is a lot
going on in doing proper multiple test correction, additional to more
traditional tests, that I haven't tried you to really understand or
where I don't know how well they generalize, e.g. (generalized)
Cochran-Mantel-Haenszel Chi-Squared Test, Cochran’s Q test.

I only read the abstract of this:
http://jnci.oxfordjournals.org/cgi/content/abstract/99/2/147

"Twenty-one (50%) of them contained at least one of the following
three basic flaws: 1) in outcome-related gene finding, an unstated,
unclear, or inadequate control for multiple testing; 2) ....."

Josef

>
> Bruce
>
>
>>
>>
>>> with an iterator (numpy mailinglist), my version for the conditional
>>> independence of the last two variables for all values of the earlier
>>> variables looks like
>>>
>>> for ind in allbut2ax_iterator(table3, axes=(-2,-1)):
>>>      print chisquare_contingency(table3[ind])
>>>
>>> Josef
>>>
>>>
>>>
>>> http://article.gmane.org/gmane.comp.python.numeric.general/38352
>>>
>>> I would have to see.
>>>
>>> Bruce
>>>
>>> Initially, I was thinking just about independence of all variables in
>>> a 3 or more way table, i.e. P(x,y,z)=P(x)*P(y)*P(z)
>>>
>>> My opinion is that these variations of tests would fit better in a
>>> class where all pairwise conditional, and marginal and joint
>>> hypotheses can be supplied as methods, or split it up into a group of
>>> functions.
>>>
>>> Thanks,
>>>
>>> Josef
>>>
>>>
>>>
>>> The data used is from two SAS examples and I added a dimension to get a
>>> 4-way table. I included the SAS values but these are only to 4 decimal
>>> places for reference.
>>>
>>> http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#/documentation/cdl/en/procstat/63104/HTML/default/procstat_freq_sect029.htm
>>> http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#/documentation/cdl/en/procstat/63104/HTML/default/procstat_freq_sect030.htm
>>>
>>> What is missing:
>>> 1) Docstring and tests but those are dependent what is ultimately decided
>>> 2) Other test statistics but scipy.stats versions are not very friendly in
>>> that these do not accept a 2-d array
>>> 3) A way to do recursion
>>> 4) Ability to label the levels etc.
>>> 5) Correct handling of input types.
>>>
>>> Bruce
>>>
>>> _______________________________________________
>>> SciPy-Dev mailing list
>>> SciPy-Dev@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> SciPy-Dev mailing list
>>> SciPy-Dev@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>>
>>>
>>> _______________________________________________
>>> SciPy-Dev mailing list
>>> SciPy-Dev@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>>
>>>
>>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
```