[SciPy-User] Fisher exact test, anyone?

Bruce Southey bsouthey@gmail....
Fri Nov 19 11:35:58 CST 2010


On Wed, Nov 17, 2010 at 7:24 AM, Ralf Gommers
<ralf.gommers@googlemail.com> wrote:
>
>
> On Wed, Nov 17, 2010 at 8:38 AM, <josef.pktd@gmail.com> wrote:
>>
>> On Tue, Nov 16, 2010 at 7:10 PM, Ralf Gommers
>> <ralf.gommers@googlemail.com> wrote:
>> >
>> >
>> > On Tue, Nov 16, 2010 at 11:45 PM, Bruce Southey <bsouthey@gmail.com>
>> > wrote:
>> >>
>> >> On 11/16/2010 07:04 AM, Ralf Gommers wrote:
>> >>
>> >> On Mon, Nov 15, 2010 at 12:40 AM, Bruce Southey <bsouthey@gmail.com>
>> >> wrote:
>> >>>
>> >>> On Sat, Nov 13, 2010 at 8:50 PM,  <josef.pktd@gmail.com> wrote:
>> >>> > http://projects.scipy.org/scipy/ticket/956 and
>> >>> > http://pypi.python.org/pypi/fisher/ have Fisher's exact
>> >>> > testimplementations.
>> >>> >
>> >>> > It would be nice to get a version in for 0.9. I spent a few
>> >>> > unsuccessful days on it earlier this year. But since there are two
>> >>> > new
>> >>> > or corrected versions available, it looks like it just needs testing
>> >>> > and a performance comparison.
>> >>> >
>> >>> > I won't have time for this, so if anyone volunteers for this, scipy
>> >>> > 0.9 should be able to get Fisher's exact.
>> >>>
>> >> https://github.com/rgommers/scipy/tree/fisher-exact
>> >> All tests pass. There's only one usable version (see below) so I didn't
>> >> do
>> >> performance comparison. I'll leave a note on #956 as well, saying we're
>> >> discussing on-list.
>> >>
>> >>> I briefly looked at the code at pypi link but I do not think it is
>> >>> good enough for scipy. Also, I do not like when people license code as
>> >>> 'BSD' and there is a comment in cfisher.pyx  '# some of this code is
>> >>> originally from the internet. (thanks)'. Consequently we can not use
>> >>> that code.
>> >>
>> >> I agree, that's not usable. The plain Python algorithm is also fast
>> >> enough
>> >> that there's no need to bother with Cython.
>> >>>
>> >>> The code with ticket 956 still needs work especially in terms of the
>> >>> input types and probably the API (like having a function that allows
>> >>> the user to select either 1 or 2 tailed tests).
>> >>
>> >> Can you explain what you mean by work on input types? I used np.asarray
>> >> and forced dtype to be int64. For the 1-tailed test, is it necessary? I
>> >> note
>> >> that pearsonr and spearmanr also only do 2-tailed.
>> >>
>> >> Cheers,
>> >> Ralf
>> >>
>> >> I have no problem including this if we can agree on the API because
>> >> everything else is internal that can be fixed by release date. So I
>> >> would
>> >> accept a place holder API that enable a user in the future to select
>> >> which
>> >> tail(s) is performed.
>> >
>> > It is always possible to add a keyword "tail" later that defaults to
>> > 2-tailed. As long as the behavior doesn't change this is perfectly fine,
>> > and
>> > better than having a placeholder.
>> >>
>> >> 1) It just can not use np.asarray() without checking the input first.
>> >> This
>> >> is particularly bad for masked arrays.
>> >>
>> > Don't understand this. The input array is not returned, only used
>> > internally. And I can't think of doing anything reasonable with a 2x2
>> > table
>> > with masked values. If that's possible at all, it should probably just
>> > go
>> > into mstats.
>> >
>> >>
>> >> 2) There are no dimension checking because, as I understand it, this
>> >> can
>> >> only handle a '2 by 2' table. I do not know enough for general 'r by c'
>> >> tables or the 1-d case either.
>> >>
>> > Don't know how easy it would be to add larger tables. I can add
>> > dimension
>> > checking with an informative error message.
>>
>> There is some discussion in the ticket about more than 2by2,
>> additions would be nice (and there are some examples on the matlab
>> fileexchange), but 2by2 is the most common case and has an unambiguous
>> definition.
>>
>>
>> >
>> >>
>> >> 3) The odds-ratio should be removed because it is not part of the test.
>> >> It
>> >> is actually more general than this test.
>> >>
>> > Don't feel strongly about this either way. It comes almost for free, and
>> > R
>> > seems to do the same.
>>
>> same here, it's kind of traditional to return two things, but in this
>> case the odds ratio is not the test statistic, but I don't see that it
>> hurts either
>>
>> >
>> >> 4) Variable names such as min and max should not shadow Python
>> >> functions.
>> >
>> > Yes, Josef noted this already, will change.
>> >>
>> >> 5) Is there a reference to the algorithm implemented? For example, SPSS
>> >> provides a simple 2 by 2 algorithm:
>> >>
>> >>
>> >> http://support.spss.com/ProductsExt/SPSS/Documentation/Statistics/algorithms/14.0/app05_sig_fisher_exact_test.pdf
>> >
>> > Not supplied, will ask on the ticket and include it.
>>
>> I thought, I saw it somewhere, but don't find the reference anymore,
>> some kind of bisection algorithm, but having a reference would be
>> good.
>> Whatever the algorithm is, it's fast, even for larger values.
>>
>> >>
>> >> 6) Why exactly does the dtype need to int64? That is, is there
>> >> something
>> >> wrong with hypergeom function? I just want to understand why the
>> >> precision
>> >> change is required because the input should enter with sufficient
>> >> precision.
>> >>
>> > This test:
>> > fisher_exact(np.array([[18000, 80000], [20000, 90000]]))
>> > becomes much slower and gives an overflow warning with int32. int32 is
>> > just
>> > not enough. This is just an implementation detail and does not in any
>> > way
>> > limit the accepted inputs, so I don't see a problem here.
>>
>> for large numbers like this the chisquare test should give almost the
>> same results, it looks pretty "asymptotic" to me. (the usual
>> recommendation for the chisquare is more than 5 expected observations
>> in each cell)
>> I think the precision is required for some edge cases when
>> probabilities get very small. The main failing case, I was fighting
>> with for several days last winter, and didn't manage to fix had a zero
>> at the first position. I didn't think about increasing the precision.
>>
>> >
>> > Don't know what the behavior should be if a user passes in floats
>> > though?
>> > Just convert to int like now, or raise a warning?
>>
>> I wouldn't do any type checking, and checking that floats are almost
>> integers doesn't sound really necessary either, unless or until users
>> complain. The standard usage should be pretty clear for contingency
>> tables with count data.
>>
>> Josef
>>
>
> Thanks for checking. https://github.com/rgommers/scipy/commit/b968ba17
> should fix remaining things. Will wait for a few days to see if we get a
> reference to the algorithm. Then will commit.
>
> Cheers,
> Ralf
>
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>

Sorry but I don't agree. But I said I do not have time to address this
and I really do not like adding the code as it is.

Bruce


More information about the SciPy-User mailing list