[SciPy-Dev] Contingency Table Model

Bruce Southey bsouthey@gmail....
Wed Aug 11 15:04:36 CDT 2010


  On 08/11/2010 02:10 PM, Anthony Scopatz wrote:
>
>
> On Mon, Aug 9, 2010 at 3:35 PM, Bruce Southey <bsouthey@gmail.com 
> <mailto:bsouthey@gmail.com>> wrote:
>
>
>     On 08/09/2010 02:31 PM, Anthony Scopatz wrote:
>>     Hello All,
>>
>>     I have just opened a ticket
>>     (http://projects.scipy.org/scipy/ticket/1258) that adds a general
>>     contingency table class to the the stats package.  This class
>>     includes methods to slice and collapse the table as well a
>>     calculate metrics such as chi-squared and entropy.
>>
>>     This implementation came out of Warren Weckesser and me working
>>     on this over the SciPy 2010 statistics sprint.
>>
>>     Please take a look!  Comments and suggestions are always welcome.
>>     Be Well,
>>     Anthony
>>
>>
>>     _______________________________________________
>>     SciPy-Dev mailing list
>>     SciPy-Dev@scipy.org <mailto:SciPy-Dev@scipy.org>
>>     http://mail.scipy.org/mailman/listinfo/scipy-dev
>
> Hello All,
>
> I have updated the ticket with new versions of the 
> contingency_table.py and test_contingency_table.py.  I also have a 
> github clone of scipy now, if you just want to grab the changes, 
> http://github.com/scopatz/scipy
>
> Issues addressed in the new version:
>
>    1. Expected tables may now be user-specified,
>    2. added from_flat() and to_flat() methods,
>    3. Retooled the chi_square() method and removed the
>       chisquare_nway() function.
>    4. All table metric methods (entropy) now add the calculated value
>       to the contingency table's attributes as well as returning the
>       value.
>
> Bruce, Thank you for your concerns.  I'd like to address your points 
> below.
>
>     1) You can not use numpy's asarray function without checking the
>     input type. You must be aware of at least masked arrays and Matrix
>     inputs as well as new data types.
>
>     2) You can not force a dtype on the user -  on line 54 when you
>     can provide optional precision.
>
>
> These are handled by now allowing the user to specify their own 
> expected table.  The expected_nway() function that these to points 
> relate to can now be avoided completely, if desired.
>
>
>     3) Can you please clarify lines 112-113?
>     "  scipy.stats.chisquare -- one-way chi-square test (which is not
>     the same
>     as the n-way test with n=1)."
>     This needs to be a little more clear because the exact same test
>     statistic is being used. In fact the function must give the
>     correct answer with 1d array.
>
>     4) Related to point 3, lines 72-74 are not correct, see
>     http://en.wikipedia.org/wiki/Pearson's_chi-square_test
>     <http://en.wikipedia.org/wiki/Pearson%27s_chi-square_test>
>
>
> The chisquared_nway() function has been removed, so 3) and 4) no 
> longer apply.
>
>     5) You must allow the user to provide their own expected values
>
> done.
>
>     6) Users need to be able to control the output - really I don't
>     want to see the table of expected values unless requested. Also a
>     user might just want the table of expected values and nothing else.
>
>
> The expected table, much like the probability table or the number of 
> degrees of freedom or the number of dimensions, is not really an 
> output.  Rather it is more of an attribute that helps calculate 
> outputs, like the entropy, mutual information, etc.  Therefore it 
> should always be included in an instance of ContingencyTable.  A user 
> could simply have an array of values that they call a contingency 
> table, but this class provides a tool for easily calculating related 
> metrics (outputs).
>
>     7) You should not need the chi2 function.
>
>
> Now required since chisquared_nway() was removed.
>
>     8) More generally, what is the need for having an ContingencyTable
>     object?
>
>
> Basically, my argument for the need is that contingency tables (or 
> cross tabulations) are expected as standard in any statistics package. 
>  R has them, Matlab has them, SPSS has them, Stata has them, and so 
> on.  I know that when I came to scipy.stats and found that they 
> weren't here already, I was disappointed.
>
> I hope this helps!
>
> Be Well
> Anthony
>
>
>
>     Bruce
>
>
>
>
>
>
>     _______________________________________________
>     SciPy-Dev mailing list
>     SciPy-Dev@scipy.org <mailto:SciPy-Dev@scipy.org>
>     http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
I am very aware that this type of functionality is available in multiple 
applications so that was never my concern. But you have failed to 
address my concerns nor addressed the the questions about why it is 
needed in this form.

An important issue is why we need this code when it was pointed out the 
similarity to numpy's histogram functions. At some stage we have to say 
no to code bloat.

Note, as a class then everything must be self-contained - both _margins 
and expected_nway have little point outside your class.

Bruce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-dev/attachments/20100811/0d622bd6/attachment-0001.html 


More information about the SciPy-Dev mailing list