[SciPy-Dev] chi-square test for a contingency (R x C) table
Bruce Southey
bsouthey@gmail....
Mon Jun 7 10:00:35 CDT 2010
On 06/07/2010 09:15 AM, josef.pktd@gmail.com wrote:
> On Fri, Jun 4, 2010 at 2:12 PM,<josef.pktd@gmail.com> wrote:
>
>> On Fri, Jun 4, 2010 at 1:08 PM, Bruce Southey<bsouthey@gmail.com> wrote:
>>
>>> On 06/03/2010 08:27 AM, Warren Weckesser wrote:
>>>
>>>> Just letting you know that I'm not ignoring all the great comments from
>>>> josef, Neil and Bruce about my suggestion for chisquare_contingency.
>>>> Unfortunately, I won't have time to think about all the deeper
>>>> suggestions for another week or so. For now, I'll just say that I
>>>> agree with josef's and Neil's suggestions for the docstring, and that
>>>> Neil's summary of the function as simply a convenience function that
>>>> calls stats.chisquare with appropriate arguments to perform a test of
>>>> independence on a contingency table is exactly what I had in mind.
>>>>
>>>> Warren
>>>>
>>>>
>>>>
>>>>
>>> Hi,
>>> I looked at how SAS handles n-way tables. What it appears to do is break the
>>> original table down into a set of 2-way tables and does the analysis on each
>>> of these. So a 3 by 4 by 5 table is processed as three 2-way tables with the
>>> results of each 4 by 5 table presented. I do not know how Stata and R
>>> analysis analyze n-way tables.
>>>
>>> Consequently, I rewrote my suggested code (attached) to handle 3 and 4 way
>>> tables by using recursion. There should be some Python way to do that
>>> recursion for any number of dimensions. I also added the 1-way table (but
>>> that has a different hypothesis than the 2-way table) so users can send a
>>> 1-d table.
>>>
>> (very briefly because I don't have much time today)
>>
>> I think, these are good extensions, but to handle all cases, the
>> function is getting too large and would need several options.
>>
>> On your code and SAS, Z(correct me if my quick reading is wrong)
>> You seem to be calculating conditional independence for the last two
>> variables conditional on the values of the first variables. I think
>> this could be generalized to all pairwise independence tests.
>>
>> Similar, I'm a bit surprised that SAS uses conditional and not
>> marginal independence, I would have thought that the test for marginal
>> independence (aggregate out all but 2 variables) would be the more
>> common use case.
>>
You can argue SAS's formulation relates to how the table is constructed
because the hypothesis associated with the table is dependent on how the
user constructs it. For example, the 3-way table A by (B by C) is very
different from the 3-way table C by (B by A) yet these involve the same
underlying numbers. If a user did not specify an order then considering
all possible hypotheses is an option.
Really log-linear models are a better approach to analysis n-way tables
because these allow you to examine all these different hypotheses.
> just some more questions and comments (until I have time to check this)
>
> looking at conditional independence looks similar to linear regression
> models, where the effect of other variables is taken out. However,
> looking at all chisquare tests (conditional on all possible other
> values) runs into the multiple test problem. Is the some kind of
> post-hoc or Bonferroni correction or is there a distribution for eg.
> the max of all chisquare test statistics.
>
Ignoring my views on this, first 'multiple test problems' do not change
the probability calculation for most approaches to compute the 'raw'
p-value as the vast majority of the approaches require the 'raw' p-value.
Second, it is very easy to say 'correct for multiple tests' but that is
pure ignorance when 'what' you are correcting is for is not stated. If
you are correcting the 'family-wise error rate' then you need to
correctly define 'family-wise' in this situation especially to address
at least one other assumption being made.
> with an iterator (numpy mailinglist), my version for the conditional
> independence of the last two variables for all values of the earlier
> variables looks like
>
> for ind in allbut2ax_iterator(table3, axes=(-2,-1)):
> print chisquare_contingency(table3[ind])
>
> Josef
>
>
A link:
http://article.gmane.org/gmane.comp.python.numeric.general/38352
I would have to see.
Bruce
>> Initially, I was thinking just about independence of all variables in
>> a 3 or more way table, i.e. P(x,y,z)=P(x)*P(y)*P(z)
>>
>> My opinion is that these variations of tests would fit better in a
>> class where all pairwise conditional, and marginal and joint
>> hypotheses can be supplied as methods, or split it up into a group of
>> functions.
>>
>> Thanks,
>>
>> Josef
>>
>>
>>> The data used is from two SAS examples and I added a dimension to get a
>>> 4-way table. I included the SAS values but these are only to 4 decimal
>>> places for reference.
>>>
>>> http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#/documentation/cdl/en/procstat/63104/HTML/default/procstat_freq_sect029.htm
>>> http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#/documentation/cdl/en/procstat/63104/HTML/default/procstat_freq_sect030.htm
>>>
>>> What is missing:
>>> 1) Docstring and tests but those are dependent what is ultimately decided
>>> 2) Other test statistics but scipy.stats versions are not very friendly in
>>> that these do not accept a 2-d array
>>> 3) A way to do recursion
>>> 4) Ability to label the levels etc.
>>> 5) Correct handling of input types.
>>>
>>> Bruce
>>>
>>> _______________________________________________
>>> SciPy-Dev mailing list
>>> SciPy-Dev@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>>
>>>
>>>
>>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-dev/attachments/20100607/f25117cc/attachment-0001.html
More information about the SciPy-Dev
mailing list