[SciPy-Dev] chi-square test for a contingency (R x C) table
Mon Jul 12 16:31:34 CDT 2010
On Sat, Jun 19, 2010 at 9:58 AM, <firstname.lastname@example.org> wrote:
> On Sat, Jun 19, 2010 at 9:26 AM, Warren Weckesser
> <email@example.com> wrote:
>> firstname.lastname@example.org wrote:
>>> Forget any merging of the functions.
>>> Statistical functions should also be defined by their purpose, we are
>>> not creating universal f_tests and t_tests. Unless someone is
>>> proposing the merge and unify various t_tests, ... ?
>>> misquoting: "The user's hypothesis is totally irrelevant ..." ???
>>> Testing for goodness-of-fit is a completely different use case, with
>>> different extensions, e.g. power discrepancy. What if I have a 2d
>>> array and want to check goodness-of-fit along each axis, which might
>>> be useful once group-by extensions to bincount handle more than 1d
>> So you are anticipating something like this (where `obs` is, say, 2D):
>> >>> chisquare_fit(table, axis=-1)
>> Then the result would also be 2D, with the last axis having length 2 and
>> holding the (chi2, p) values?
> I haven't looked at this closely yet, but I would think it would be a
> standard reduce by one axis, usually we would return one array for the
> test statistic and one array for the p-values (both same dimension
> equal to one less than the original)
> chisquare_fit(table, axis=-1) as equivalent to [chisquare(table[k])
> for k in range(table.size)] for 2d
> and apply_along_axis for nd
> This would be easy to extend but I don't know how much the need is for
> this currently.
> eg. if we have a sample by geographic region or groups, we might want
> to test whether the distribution is uniform or normal in each group.
> (continuous distributions would require binning first)
>>> Or if we extend it to multivariate distributions, then the
>>> default might be uniform for each column (and not independence.)
>>> This is a standard test for distributions, and should not be mixed
>>> with contingency tables
>> Could you elaborate on this use case? I don't know enough about it to
>> be able to decide if this is something that could be implemented right
>> away, or if it is something that might not happen for years, if ever.
> During this thread, I started to think of contingency tables just as a
> nd discrete distribution, where we can have functions for the
> multivariate distributions, marginal pdf, conditional pdf, ... and
> some tests on it.
> Independence in this case would be just one hypothesis.
> Also, the chisquare independence test conditions on the margin totals,
> this might be the most common case, but not necessarily the only
> chisquare hypothesis we might test. (I'm not to clear on all the
> contingency table stuff.)
> multivariate distributions are only on my wish list, and it will
> require some work to go beyond pdf, loglike and rvs.
> multivariate discrete (contingency tables without the statistics) and
> multivariate normal and some others would be the first candidates.
> (copulas would be another multivariate distribution wish)
> I don't know what would be the ETA (expected time of arrival) for these.
> I like your current implementation, because it's right to the point
> and easy to explain and use. And it looks forward compatible to
> extended functionality that we might think of.
>>> contingency tables are a different case, which I never use, and where
>>> I would go with whatever statisticians prefer. But I think, going by
>>> null hypothesis makes functions for statistical tests much cleaner
>>> (easier to categorize, explain, find) than one-stop statistics (at
>>> least for functions and not methods in classes) as is the current
>>> tradition of scipy.stats.
>>> "fit" in your function name is very misleading chisquare_fit, because
>>> your function doesn't do any fitting. If a rename is desired, I would
>>> call it chisquare_gof, but I use a similar name for the actual gof
>>> test based on the sample data, with automatic binning.
>>> Fitting the distribution parameters raises other issues which I don't
>>> think should be mixed with the basic chisquare-test
>> Yes, I agree. I only used "fit" to distinguish it from "ind". I didn't
>> want to use "oneway" and "nway", because those names might lead one to
>> think that "oneway" is the n=1 case of "nway", but it is not.
>> SciPy-Dev mailing list
found when I was looking for something different and I never used it.
More information about the SciPy-Dev