[SciPy-Dev] chi-square test for a contingency (R x C) table

Bruce Southey bsouthey@gmail....
Mon Jun 7 11:45:07 CDT 2010


On 06/07/2010 10:45 AM, josef.pktd@gmail.com wrote:
> On Mon, Jun 7, 2010 at 11:00 AM, Bruce Southey<bsouthey@gmail.com>  wrote:
>    
>> On 06/07/2010 09:15 AM, josef.pktd@gmail.com wrote:
>>
>> On Fri, Jun 4, 2010 at 2:12 PM,<josef.pktd@gmail.com>  wrote:
>>
>>
>> On Fri, Jun 4, 2010 at 1:08 PM, Bruce Southey<bsouthey@gmail.com>  wrote:
>>
>>
>> On 06/03/2010 08:27 AM, Warren Weckesser wrote:
>>
>>
>> Just letting you know that I'm not ignoring all the great comments from
>> josef, Neil and Bruce about my suggestion for chisquare_contingency.
>> Unfortunately, I won't have time to think about all the deeper
>> suggestions for another week or so.   For now, I'll just say that I
>> agree with josef's and Neil's suggestions for the docstring, and that
>> Neil's summary of the function as simply a convenience function that
>> calls stats.chisquare with appropriate arguments to perform a test of
>> independence on a contingency table is exactly what I had in mind.
>>
>> Warren
>>
>>
>>
>>
>>
>> Hi,
>> I looked at how SAS handles n-way tables. What it appears to do is break the
>> original table down into a set of 2-way tables and does the analysis on each
>> of these. So a 3 by 4 by 5 table is processed as three 2-way tables with the
>> results of each 4 by 5 table presented. I do not know how Stata and R
>> analysis analyze n-way tables.
>>
>> Consequently, I rewrote my suggested code (attached) to handle 3 and 4 way
>> tables by using recursion. There should be some Python way to do that
>> recursion for any number of dimensions. I also added the 1-way table (but
>> that has a different hypothesis than the 2-way table) so users can send a
>> 1-d table.
>>
>>
>> (very briefly because I don't have much time today)
>>
>> I think, these are good extensions, but to handle all cases, the
>> function is getting too large and would need several options.
>>
>> On your code and SAS, Z(correct me if my quick reading is wrong)
>> You seem to be calculating conditional independence for the last two
>> variables conditional on the values of the first variables. I think
>> this could be generalized to all pairwise independence tests.
>>
>> Similar, I'm a bit surprised that SAS uses conditional and not
>> marginal independence, I would have thought that the test for marginal
>> independence (aggregate out all but 2 variables) would be the more
>> common use case.
>>
>>
>> You can argue SAS's formulation relates to how the table is constructed
>> because the hypothesis associated with the table is dependent on how the
>> user constructs it. For example, the 3-way table A by (B by C) is very
>> different from the 3-way table C by (B by A) yet these involve the same
>> underlying numbers. If a user did not specify an order then considering all
>> possible hypotheses is an option.
>>      
> I don't know the SAS notation, what I thought in analogy to regression
> analysis, is that if one variable is considered as endogenous, then
> only pairwise tests with this variable need to be included.
>    
This is not the same as regression for multiple reasons. Here we are 
testing independence without any distribution assumption associated with 
the actual data. (Of course under the normality assumption then these 
are the same. )

>    
>> Really log-linear models are a better approach to analysis n-way tables
>> because these allow you to examine all these different hypotheses.
>>
>> just some more questions and comments (until I have time to check this)
>>
>> looking at conditional independence looks similar to linear regression
>> models, where the effect of other variables is taken out. However,
>> looking at all chisquare tests (conditional on all possible other
>> values) runs into the multiple test problem. Is the some kind of
>> post-hoc or Bonferroni correction or is there a distribution for eg.
>> the max of all chisquare test statistics.
>>
>>
>> Ignoring my views on this, first 'multiple test problems' do not change the
>> probability calculation for most approaches to compute the 'raw' p-value as
>> the vast majority of the approaches require the 'raw' p-value.
>>
>> Second, it is very easy to say 'correct for multiple tests' but that is pure
>> ignorance when 'what' you are correcting is for is not stated. If you are
>> correcting the 'family-wise error rate' then you need to correctly define
>> 'family-wise' in this situation especially to address at least one other
>> assumption being made.
>>      
> I know nothing about this in the context of contingency tables.
In a 2-way table there is no need for any correction so it is pointless 
to say 'correct for multiple tests'. In a 3-way or higher table, as you 
indicated, is essentially a test of conditional independence as I 
implemented it. It is also pointless to say 'correct for multiple tests' 
because you are first assuming conditional independence between say A by 
B given C=1 and A by B for C=2. So what happens when C=1 is independent 
of when C=2 so these do belong to different 'families'. Second, there is 
nothing said about the relation of either A  or B with C - which may be 
a more critical problem.

> We
> recently had the discussion about multiple tests in the context of
> post-hoc tests for anova, where I had to read up.
>    
I am perhaps too aware of multiple testing and unfortunately these types 
of discussions go on and on and on. A lot depends on which of many 
'schools' of thought you subscribe to. It basically amounts to 'hand 
waving'  with no solution because these schools are defined by different 
fundamental  assumptions that can not be challenged. Ultimately none are 
correct because we never know the true situation - if we did we would 
not be doing it.
> In econometrics, there is an extensive literature on this, and some
> cases like structural change tests with unknown change points I know
> pretty well.
>
> The main point that I wanted to make is, that multiple change tests
> need more attention and at least a warning in the docstring which
> (raw) p-values are reported, since it is easy for unwary users to
> misinterpret the reported p-values. But hopefully this could be
> extended to provide the user with options to do an appropriate
> correction.
>
> Josef
>    
This is pointless because you are misunderstanding what is meant by 
'multiple test correction'. Placing those kinds of statements in the 
wrong places also reflects ignorance especially when the correct value 
maybe given and there is no 'appropriate' correction possible. Further 
no statement is ever going to protect users from misinterpreting p-values.

Bruce


>
>    
>> with an iterator (numpy mailinglist), my version for the conditional
>> independence of the last two variables for all values of the earlier
>> variables looks like
>>
>> for ind in allbut2ax_iterator(table3, axes=(-2,-1)):
>>      print chisquare_contingency(table3[ind])
>>
>> Josef
>>
>>
>>
>> A link:
>> http://article.gmane.org/gmane.comp.python.numeric.general/38352
>>
>> I would have to see.
>>
>> Bruce
>>
>> Initially, I was thinking just about independence of all variables in
>> a 3 or more way table, i.e. P(x,y,z)=P(x)*P(y)*P(z)
>>
>> My opinion is that these variations of tests would fit better in a
>> class where all pairwise conditional, and marginal and joint
>> hypotheses can be supplied as methods, or split it up into a group of
>> functions.
>>
>> Thanks,
>>
>> Josef
>>
>>
>>
>> The data used is from two SAS examples and I added a dimension to get a
>> 4-way table. I included the SAS values but these are only to 4 decimal
>> places for reference.
>>
>> http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#/documentation/cdl/en/procstat/63104/HTML/default/procstat_freq_sect029.htm
>> http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#/documentation/cdl/en/procstat/63104/HTML/default/procstat_freq_sect030.htm
>>
>> What is missing:
>> 1) Docstring and tests but those are dependent what is ultimately decided
>> 2) Other test statistics but scipy.stats versions are not very friendly in
>> that these do not accept a 2-d array
>> 3) A way to do recursion
>> 4) Ability to label the levels etc.
>> 5) Correct handling of input types.
>>
>> Bruce
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>>
>>      
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>    



More information about the SciPy-Dev mailing list