[SciPy-Dev] chi-square test for a contingency (R x C) table
josef.pktd@gmai...
josef.pktd@gmai...
Mon Jun 7 10:45:06 CDT 2010
On Mon, Jun 7, 2010 at 11:00 AM, Bruce Southey <bsouthey@gmail.com> wrote:
> On 06/07/2010 09:15 AM, josef.pktd@gmail.com wrote:
>
> On Fri, Jun 4, 2010 at 2:12 PM, <josef.pktd@gmail.com> wrote:
>
>
> On Fri, Jun 4, 2010 at 1:08 PM, Bruce Southey <bsouthey@gmail.com> wrote:
>
>
> On 06/03/2010 08:27 AM, Warren Weckesser wrote:
>
>
> Just letting you know that I'm not ignoring all the great comments from
> josef, Neil and Bruce about my suggestion for chisquare_contingency.
> Unfortunately, I won't have time to think about all the deeper
> suggestions for another week or so. For now, I'll just say that I
> agree with josef's and Neil's suggestions for the docstring, and that
> Neil's summary of the function as simply a convenience function that
> calls stats.chisquare with appropriate arguments to perform a test of
> independence on a contingency table is exactly what I had in mind.
>
> Warren
>
>
>
>
>
> Hi,
> I looked at how SAS handles n-way tables. What it appears to do is break the
> original table down into a set of 2-way tables and does the analysis on each
> of these. So a 3 by 4 by 5 table is processed as three 2-way tables with the
> results of each 4 by 5 table presented. I do not know how Stata and R
> analysis analyze n-way tables.
>
> Consequently, I rewrote my suggested code (attached) to handle 3 and 4 way
> tables by using recursion. There should be some Python way to do that
> recursion for any number of dimensions. I also added the 1-way table (but
> that has a different hypothesis than the 2-way table) so users can send a
> 1-d table.
>
>
> (very briefly because I don't have much time today)
>
> I think, these are good extensions, but to handle all cases, the
> function is getting too large and would need several options.
>
> On your code and SAS, Z(correct me if my quick reading is wrong)
> You seem to be calculating conditional independence for the last two
> variables conditional on the values of the first variables. I think
> this could be generalized to all pairwise independence tests.
>
> Similar, I'm a bit surprised that SAS uses conditional and not
> marginal independence, I would have thought that the test for marginal
> independence (aggregate out all but 2 variables) would be the more
> common use case.
>
>
> You can argue SAS's formulation relates to how the table is constructed
> because the hypothesis associated with the table is dependent on how the
> user constructs it. For example, the 3-way table A by (B by C) is very
> different from the 3-way table C by (B by A) yet these involve the same
> underlying numbers. If a user did not specify an order then considering all
> possible hypotheses is an option.
I don't know the SAS notation, what I thought in analogy to regression
analysis, is that if one variable is considered as endogenous, then
only pairwise tests with this variable need to be included.
>
> Really log-linear models are a better approach to analysis n-way tables
> because these allow you to examine all these different hypotheses.
>
> just some more questions and comments (until I have time to check this)
>
> looking at conditional independence looks similar to linear regression
> models, where the effect of other variables is taken out. However,
> looking at all chisquare tests (conditional on all possible other
> values) runs into the multiple test problem. Is the some kind of
> post-hoc or Bonferroni correction or is there a distribution for eg.
> the max of all chisquare test statistics.
>
>
> Ignoring my views on this, first 'multiple test problems' do not change the
> probability calculation for most approaches to compute the 'raw' p-value as
> the vast majority of the approaches require the 'raw' p-value.
>
> Second, it is very easy to say 'correct for multiple tests' but that is pure
> ignorance when 'what' you are correcting is for is not stated. If you are
> correcting the 'family-wise error rate' then you need to correctly define
> 'family-wise' in this situation especially to address at least one other
> assumption being made.
I know nothing about this in the context of contingency tables. We
recently had the discussion about multiple tests in the context of
post-hoc tests for anova, where I had to read up.
In econometrics, there is an extensive literature on this, and some
cases like structural change tests with unknown change points I know
pretty well.
The main point that I wanted to make is, that multiple change tests
need more attention and at least a warning in the docstring which
(raw) p-values are reported, since it is easy for unwary users to
misinterpret the reported p-values. But hopefully this could be
extended to provide the user with options to do an appropriate
correction.
Josef
>
> with an iterator (numpy mailinglist), my version for the conditional
> independence of the last two variables for all values of the earlier
> variables looks like
>
> for ind in allbut2ax_iterator(table3, axes=(-2,-1)):
> print chisquare_contingency(table3[ind])
>
> Josef
>
>
>
> A link:
> http://article.gmane.org/gmane.comp.python.numeric.general/38352
>
> I would have to see.
>
> Bruce
>
> Initially, I was thinking just about independence of all variables in
> a 3 or more way table, i.e. P(x,y,z)=P(x)*P(y)*P(z)
>
> My opinion is that these variations of tests would fit better in a
> class where all pairwise conditional, and marginal and joint
> hypotheses can be supplied as methods, or split it up into a group of
> functions.
>
> Thanks,
>
> Josef
>
>
>
> The data used is from two SAS examples and I added a dimension to get a
> 4-way table. I included the SAS values but these are only to 4 decimal
> places for reference.
>
> http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#/documentation/cdl/en/procstat/63104/HTML/default/procstat_freq_sect029.htm
> http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#/documentation/cdl/en/procstat/63104/HTML/default/procstat_freq_sect030.htm
>
> What is missing:
> 1) Docstring and tests but those are dependent what is ultimately decided
> 2) Other test statistics but scipy.stats versions are not very friendly in
> that these do not accept a 2-d array
> 3) A way to do recursion
> 4) Ability to label the levels etc.
> 5) Correct handling of input types.
>
> Bruce
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
>
>
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
More information about the SciPy-Dev
mailing list