[SciPy-Dev] chi-square test for a contingency (R x C) table

Warren Weckesser warren.weckesser@enthought....
Wed Jun 16 23:58:38 CDT 2010


The feedback in this thread inspired me to generalize my original code 
to the n-way test of independence.  I have attached the revised code to 
a new ticket:

    http://projects.scipy.org/scipy/ticket/1203

More feedback would be great!

Warren


josef.pktd@gmail.com wrote:
> On Mon, Jun 7, 2010 at 12:45 PM, Bruce Southey <bsouthey@gmail.com> wrote:
>   
>> On 06/07/2010 10:45 AM, josef.pktd@gmail.com wrote:
>>     
>>> On Mon, Jun 7, 2010 at 11:00 AM, Bruce Southey<bsouthey@gmail.com>  wrote:
>>>
>>>       
>>>> On 06/07/2010 09:15 AM, josef.pktd@gmail.com wrote:
>>>>
>>>> On Fri, Jun 4, 2010 at 2:12 PM,<josef.pktd@gmail.com>  wrote:
>>>>
>>>>
>>>> On Fri, Jun 4, 2010 at 1:08 PM, Bruce Southey<bsouthey@gmail.com>  wrote:
>>>>
>>>>
>>>> On 06/03/2010 08:27 AM, Warren Weckesser wrote:
>>>>
>>>>
>>>> Just letting you know that I'm not ignoring all the great comments from
>>>> josef, Neil and Bruce about my suggestion for chisquare_contingency.
>>>> Unfortunately, I won't have time to think about all the deeper
>>>> suggestions for another week or so.   For now, I'll just say that I
>>>> agree with josef's and Neil's suggestions for the docstring, and that
>>>> Neil's summary of the function as simply a convenience function that
>>>> calls stats.chisquare with appropriate arguments to perform a test of
>>>> independence on a contingency table is exactly what I had in mind.
>>>>
>>>> Warren
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Hi,
>>>> I looked at how SAS handles n-way tables. What it appears to do is break the
>>>> original table down into a set of 2-way tables and does the analysis on each
>>>> of these. So a 3 by 4 by 5 table is processed as three 2-way tables with the
>>>> results of each 4 by 5 table presented. I do not know how Stata and R
>>>> analysis analyze n-way tables.
>>>>
>>>> Consequently, I rewrote my suggested code (attached) to handle 3 and 4 way
>>>> tables by using recursion. There should be some Python way to do that
>>>> recursion for any number of dimensions. I also added the 1-way table (but
>>>> that has a different hypothesis than the 2-way table) so users can send a
>>>> 1-d table.
>>>>
>>>>
>>>> (very briefly because I don't have much time today)
>>>>
>>>> I think, these are good extensions, but to handle all cases, the
>>>> function is getting too large and would need several options.
>>>>
>>>> On your code and SAS, Z(correct me if my quick reading is wrong)
>>>> You seem to be calculating conditional independence for the last two
>>>> variables conditional on the values of the first variables. I think
>>>> this could be generalized to all pairwise independence tests.
>>>>
>>>> Similar, I'm a bit surprised that SAS uses conditional and not
>>>> marginal independence, I would have thought that the test for marginal
>>>> independence (aggregate out all but 2 variables) would be the more
>>>> common use case.
>>>>
>>>>
>>>> You can argue SAS's formulation relates to how the table is constructed
>>>> because the hypothesis associated with the table is dependent on how the
>>>> user constructs it. For example, the 3-way table A by (B by C) is very
>>>> different from the 3-way table C by (B by A) yet these involve the same
>>>> underlying numbers. If a user did not specify an order then considering all
>>>> possible hypotheses is an option.
>>>>
>>>>         
>>> I don't know the SAS notation, what I thought in analogy to regression
>>> analysis, is that if one variable is considered as endogenous, then
>>> only pairwise tests with this variable need to be included.
>>>
>>>       
>> This is not the same as regression for multiple reasons. Here we are
>> testing independence without any distribution assumption associated with
>> the actual data. (Of course under the normality assumption then these
>> are the same. )
>>
>>     
>>>> Really log-linear models are a better approach to analysis n-way tables
>>>> because these allow you to examine all these different hypotheses.
>>>>
>>>> just some more questions and comments (until I have time to check this)
>>>>
>>>> looking at conditional independence looks similar to linear regression
>>>> models, where the effect of other variables is taken out. However,
>>>> looking at all chisquare tests (conditional on all possible other
>>>> values) runs into the multiple test problem. Is the some kind of
>>>> post-hoc or Bonferroni correction or is there a distribution for eg.
>>>> the max of all chisquare test statistics.
>>>>
>>>>
>>>> Ignoring my views on this, first 'multiple test problems' do not change the
>>>> probability calculation for most approaches to compute the 'raw' p-value as
>>>> the vast majority of the approaches require the 'raw' p-value.
>>>>
>>>> Second, it is very easy to say 'correct for multiple tests' but that is pure
>>>> ignorance when 'what' you are correcting is for is not stated. If you are
>>>> correcting the 'family-wise error rate' then you need to correctly define
>>>> 'family-wise' in this situation especially to address at least one other
>>>> assumption being made.
>>>>
>>>>         
>>> I know nothing about this in the context of contingency tables.
>>>       
>> In a 2-way table there is no need for any correction so it is pointless
>> to say 'correct for multiple tests'. In a 3-way or higher table, as you
>> indicated, is essentially a test of conditional independence as I
>> implemented it. It is also pointless to say 'correct for multiple tests'
>> because you are first assuming conditional independence between say A by
>> B given C=1 and A by B for C=2. So what happens when C=1 is independent
>> of when C=2 so these do belong to different 'families'. Second, there is
>> nothing said about the relation of either A  or B with C - which may be
>> a more critical problem.
>>
>>     
>>> We
>>> recently had the discussion about multiple tests in the context of
>>> post-hoc tests for anova, where I had to read up.
>>>
>>>       
>> I am perhaps too aware of multiple testing and unfortunately these types
>> of discussions go on and on and on. A lot depends on which of many
>> 'schools' of thought you subscribe to. It basically amounts to 'hand
>> waving'  with no solution because these schools are defined by different
>> fundamental  assumptions that can not be challenged. Ultimately none are
>> correct because we never know the true situation - if we did we would
>> not be doing it.
>>     
>
> I think it depends on the hypothesis and the general statistical
> theory is relatively clear, but maybe some people prefer a
> "test-mining" approach.
>
>
>   
>>> In econometrics, there is an extensive literature on this, and some
>>> cases like structural change tests with unknown change points I know
>>> pretty well.
>>>
>>> The main point that I wanted to make is, that multiple change tests
>>> need more attention and at least a warning in the docstring which
>>> (raw) p-values are reported, since it is easy for unwary users to
>>> misinterpret the reported p-values. But hopefully this could be
>>> extended to provide the user with options to do an appropriate
>>> correction.
>>>
>>> Josef
>>>
>>>       
>> This is pointless because you are misunderstanding what is meant by
>> 'multiple test correction'.
>>     
>
> ???
>
>   
>> Placing those kinds of statements in the
>> wrong places also reflects ignorance especially when the correct value
>> maybe given and there is no 'appropriate' correction possible. Further
>> no statement is ever going to protect users from misinterpreting p-values.
>>     
>
> Doing a quick search on the recent literature, it seems there is a lot
> going on in doing proper multiple test correction, additional to more
> traditional tests, that I haven't tried you to really understand or
> where I don't know how well they generalize, e.g. (generalized)
> Cochran-Mantel-Haenszel Chi-Squared Test, Cochran’s Q test.
>
> I only read the abstract of this:
> http://jnci.oxfordjournals.org/cgi/content/abstract/99/2/147
>
> "Twenty-one (50%) of them contained at least one of the following
> three basic flaws: 1) in outcome-related gene finding, an unstated,
> unclear, or inadequate control for multiple testing; 2) ....."
>
> Josef
>
>   
>> Bruce
>>
>>
>>     
>>>       
>>>> with an iterator (numpy mailinglist), my version for the conditional
>>>> independence of the last two variables for all values of the earlier
>>>> variables looks like
>>>>
>>>> for ind in allbut2ax_iterator(table3, axes=(-2,-1)):
>>>>      print chisquare_contingency(table3[ind])
>>>>
>>>> Josef
>>>>
>>>>
>>>>
>>>> A link:
>>>> http://article.gmane.org/gmane.comp.python.numeric.general/38352
>>>>
>>>> I would have to see.
>>>>
>>>> Bruce
>>>>
>>>> Initially, I was thinking just about independence of all variables in
>>>> a 3 or more way table, i.e. P(x,y,z)=P(x)*P(y)*P(z)
>>>>
>>>> My opinion is that these variations of tests would fit better in a
>>>> class where all pairwise conditional, and marginal and joint
>>>> hypotheses can be supplied as methods, or split it up into a group of
>>>> functions.
>>>>
>>>> Thanks,
>>>>
>>>> Josef
>>>>
>>>>
>>>>
>>>> The data used is from two SAS examples and I added a dimension to get a
>>>> 4-way table. I included the SAS values but these are only to 4 decimal
>>>> places for reference.
>>>>
>>>> http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#/documentation/cdl/en/procstat/63104/HTML/default/procstat_freq_sect029.htm
>>>> http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#/documentation/cdl/en/procstat/63104/HTML/default/procstat_freq_sect030.htm
>>>>
>>>> What is missing:
>>>> 1) Docstring and tests but those are dependent what is ultimately decided
>>>> 2) Other test statistics but scipy.stats versions are not very friendly in
>>>> that these do not accept a 2-d array
>>>> 3) A way to do recursion
>>>> 4) Ability to label the levels etc.
>>>> 5) Correct handling of input types.
>>>>
>>>> Bruce
>>>>
>>>> _______________________________________________
>>>> SciPy-Dev mailing list
>>>> SciPy-Dev@scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> SciPy-Dev mailing list
>>>> SciPy-Dev@scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>>>
>>>>
>>>> _______________________________________________
>>>> SciPy-Dev mailing list
>>>> SciPy-Dev@scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>>>
>>>>
>>>>
>>>>         
>>> _______________________________________________
>>> SciPy-Dev mailing list
>>> SciPy-Dev@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>>
>>>       
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>>     
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>   



More information about the SciPy-Dev mailing list