[SciPy-User] Fisher exact test, anyone?
josef.pktd@gmai...
josef.pktd@gmai...
Wed Nov 17 07:56:50 CST 2010
On Wed, Nov 17, 2010 at 8:24 AM, Ralf Gommers
<ralf.gommers@googlemail.com> wrote:
>
>
> On Wed, Nov 17, 2010 at 8:38 AM, <josef.pktd@gmail.com> wrote:
>>
>> On Tue, Nov 16, 2010 at 7:10 PM, Ralf Gommers
>> <ralf.gommers@googlemail.com> wrote:
>> >
>> >
>> > On Tue, Nov 16, 2010 at 11:45 PM, Bruce Southey <bsouthey@gmail.com>
>> > wrote:
>> >>
>> >> On 11/16/2010 07:04 AM, Ralf Gommers wrote:
>> >>
>> >> On Mon, Nov 15, 2010 at 12:40 AM, Bruce Southey <bsouthey@gmail.com>
>> >> wrote:
>> >>>
>> >>> On Sat, Nov 13, 2010 at 8:50 PM, <josef.pktd@gmail.com> wrote:
>> >>> > http://projects.scipy.org/scipy/ticket/956 and
>> >>> > http://pypi.python.org/pypi/fisher/ have Fisher's exact
>> >>> > testimplementations.
>> >>> >
>> >>> > It would be nice to get a version in for 0.9. I spent a few
>> >>> > unsuccessful days on it earlier this year. But since there are two
>> >>> > new
>> >>> > or corrected versions available, it looks like it just needs testing
>> >>> > and a performance comparison.
>> >>> >
>> >>> > I won't have time for this, so if anyone volunteers for this, scipy
>> >>> > 0.9 should be able to get Fisher's exact.
>> >>>
>> >> https://github.com/rgommers/scipy/tree/fisher-exact
>> >> All tests pass. There's only one usable version (see below) so I didn't
>> >> do
>> >> performance comparison. I'll leave a note on #956 as well, saying we're
>> >> discussing on-list.
>> >>
>> >>> I briefly looked at the code at pypi link but I do not think it is
>> >>> good enough for scipy. Also, I do not like when people license code as
>> >>> 'BSD' and there is a comment in cfisher.pyx '# some of this code is
>> >>> originally from the internet. (thanks)'. Consequently we can not use
>> >>> that code.
>> >>
>> >> I agree, that's not usable. The plain Python algorithm is also fast
>> >> enough
>> >> that there's no need to bother with Cython.
>> >>>
>> >>> The code with ticket 956 still needs work especially in terms of the
>> >>> input types and probably the API (like having a function that allows
>> >>> the user to select either 1 or 2 tailed tests).
>> >>
>> >> Can you explain what you mean by work on input types? I used np.asarray
>> >> and forced dtype to be int64. For the 1-tailed test, is it necessary? I
>> >> note
>> >> that pearsonr and spearmanr also only do 2-tailed.
>> >>
>> >> Cheers,
>> >> Ralf
>> >>
>> >> I have no problem including this if we can agree on the API because
>> >> everything else is internal that can be fixed by release date. So I
>> >> would
>> >> accept a place holder API that enable a user in the future to select
>> >> which
>> >> tail(s) is performed.
>> >
>> > It is always possible to add a keyword "tail" later that defaults to
>> > 2-tailed. As long as the behavior doesn't change this is perfectly fine,
>> > and
>> > better than having a placeholder.
>> >>
>> >> 1) It just can not use np.asarray() without checking the input first.
>> >> This
>> >> is particularly bad for masked arrays.
>> >>
>> > Don't understand this. The input array is not returned, only used
>> > internally. And I can't think of doing anything reasonable with a 2x2
>> > table
>> > with masked values. If that's possible at all, it should probably just
>> > go
>> > into mstats.
>> >
>> >>
>> >> 2) There are no dimension checking because, as I understand it, this
>> >> can
>> >> only handle a '2 by 2' table. I do not know enough for general 'r by c'
>> >> tables or the 1-d case either.
>> >>
>> > Don't know how easy it would be to add larger tables. I can add
>> > dimension
>> > checking with an informative error message.
>>
>> There is some discussion in the ticket about more than 2by2,
>> additions would be nice (and there are some examples on the matlab
>> fileexchange), but 2by2 is the most common case and has an unambiguous
>> definition.
>>
>>
>> >
>> >>
>> >> 3) The odds-ratio should be removed because it is not part of the test.
>> >> It
>> >> is actually more general than this test.
>> >>
>> > Don't feel strongly about this either way. It comes almost for free, and
>> > R
>> > seems to do the same.
>>
>> same here, it's kind of traditional to return two things, but in this
>> case the odds ratio is not the test statistic, but I don't see that it
>> hurts either
>>
>> >
>> >> 4) Variable names such as min and max should not shadow Python
>> >> functions.
>> >
>> > Yes, Josef noted this already, will change.
>> >>
>> >> 5) Is there a reference to the algorithm implemented? For example, SPSS
>> >> provides a simple 2 by 2 algorithm:
>> >>
>> >>
>> >> http://support.spss.com/ProductsExt/SPSS/Documentation/Statistics/algorithms/14.0/app05_sig_fisher_exact_test.pdf
>> >
>> > Not supplied, will ask on the ticket and include it.
>>
>> I thought, I saw it somewhere, but don't find the reference anymore,
>> some kind of bisection algorithm, but having a reference would be
>> good.
>> Whatever the algorithm is, it's fast, even for larger values.
>>
>> >>
>> >> 6) Why exactly does the dtype need to int64? That is, is there
>> >> something
>> >> wrong with hypergeom function? I just want to understand why the
>> >> precision
>> >> change is required because the input should enter with sufficient
>> >> precision.
>> >>
>> > This test:
>> > fisher_exact(np.array([[18000, 80000], [20000, 90000]]))
>> > becomes much slower and gives an overflow warning with int32. int32 is
>> > just
>> > not enough. This is just an implementation detail and does not in any
>> > way
>> > limit the accepted inputs, so I don't see a problem here.
>>
>> for large numbers like this the chisquare test should give almost the
>> same results, it looks pretty "asymptotic" to me. (the usual
>> recommendation for the chisquare is more than 5 expected observations
>> in each cell)
>> I think the precision is required for some edge cases when
>> probabilities get very small. The main failing case, I was fighting
>> with for several days last winter, and didn't manage to fix had a zero
>> at the first position. I didn't think about increasing the precision.
>>
>> >
>> > Don't know what the behavior should be if a user passes in floats
>> > though?
>> > Just convert to int like now, or raise a warning?
>>
>> I wouldn't do any type checking, and checking that floats are almost
>> integers doesn't sound really necessary either, unless or until users
>> complain. The standard usage should be pretty clear for contingency
>> tables with count data.
>>
>> Josef
>>
>
> Thanks for checking. https://github.com/rgommers/scipy/commit/b968ba17
> should fix remaining things. Will wait for a few days to see if we get a
> reference to the algorithm. Then will commit.
It looks good to me. I think you can commit whenever you want. We can
add the reference also later through the doceditor.
Thanks for finding the hypergeom imprecision example. We can use it
for the sf ticket.
I think, from a statistical viewpoint this imprecision is pretty
irrelevant, whether a pvalue is 1e-9 or 1e-8, wouldn't change the
conclusion about strongly rejecting the null hypothesis.
I usually use almost_equal instead of approx_equal which often
wouldn't even fail on differences like this. Many statistical packages
report only 3 decimals by default because everything else only gives a
false sense of precision given the randomness of the problem.
However, independently of this, discrete cdf and sf need a workover,
but maybe not so urgently if these are the only problems.
Josef
>
> Cheers,
> Ralf
>
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
More information about the SciPy-User
mailing list