[SciPy-User] Fisher exact test, anyone?

Bruce Southey bsouthey@gmail....
Sun Nov 21 13:03:51 CST 2010


On Sun, Nov 21, 2010 at 2:23 AM, Ralf Gommers
<ralf.gommers@googlemail.com> wrote:
>
>
> On Sat, Nov 20, 2010 at 1:35 AM, Bruce Southey <bsouthey@gmail.com> wrote:
>>
>> On Wed, Nov 17, 2010 at 7:24 AM, Ralf Gommers
>> <ralf.gommers@googlemail.com> wrote:
>> >
>> >
>> > On Wed, Nov 17, 2010 at 8:38 AM, <josef.pktd@gmail.com> wrote:
>> >>
>> >> On Tue, Nov 16, 2010 at 7:10 PM, Ralf Gommers
>> >> <ralf.gommers@googlemail.com> wrote:
>> >> >
>> >> >
>> >> > On Tue, Nov 16, 2010 at 11:45 PM, Bruce Southey <bsouthey@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> I have no problem including this if we can agree on the API because
>> >> >> everything else is internal that can be fixed by release date. So I
>> >> >> would
>> >> >> accept a place holder API that enable a user in the future to select
>> >> >> which
>> >> >> tail(s) is performed.
>> >> >
>> >> > It is always possible to add a keyword "tail" later that defaults to
>> >> > 2-tailed. As long as the behavior doesn't change this is perfectly
>> >> > fine,
>> >> > and
>> >> > better than having a placeholder.
>> >> >>
>> >> >> 1) It just can not use np.asarray() without checking the input
>> >> >> first.
>> >> >> This
>> >> >> is particularly bad for masked arrays.
>> >> >>
>> >> > Don't understand this. The input array is not returned, only used
>> >> > internally. And I can't think of doing anything reasonable with a 2x2
>> >> > table
>> >> > with masked values. If that's possible at all, it should probably
>> >> > just
>> >> > go
>> >> > into mstats.
>> >> >
>> >> >>
>> >> >> 2) There are no dimension checking because, as I understand it, this
>> >> >> can
>> >> >> only handle a '2 by 2' table. I do not know enough for general 'r by
>> >> >> c'
>> >> >> tables or the 1-d case either.
>> >> >>
>> >> > Don't know how easy it would be to add larger tables. I can add
>> >> > dimension
>> >> > checking with an informative error message.
>> >>
>> >> There is some discussion in the ticket about more than 2by2,
>> >> additions would be nice (and there are some examples on the matlab
>> >> fileexchange), but 2by2 is the most common case and has an unambiguous
>> >> definition.
>> >>
>> >>
>> >> >
>> >> >>
>> >> >> 3) The odds-ratio should be removed because it is not part of the
>> >> >> test.
>> >> >> It
>> >> >> is actually more general than this test.
>> >> >>
>> >> > Don't feel strongly about this either way. It comes almost for free,
>> >> > and
>> >> > R
>> >> > seems to do the same.
>> >>
>> >> same here, it's kind of traditional to return two things, but in this
>> >> case the odds ratio is not the test statistic, but I don't see that it
>> >> hurts either
>> >>
>> >> >
>> >> >> 4) Variable names such as min and max should not shadow Python
>> >> >> functions.
>> >> >
>> >> > Yes, Josef noted this already, will change.
>> >> >>
>> >> >> 5) Is there a reference to the algorithm implemented? For example,
>> >> >> SPSS
>> >> >> provides a simple 2 by 2 algorithm:
>> >> >>
>> >> >>
>> >> >>
>> >> >> http://support.spss.com/ProductsExt/SPSS/Documentation/Statistics/algorithms/14.0/app05_sig_fisher_exact_test.pdf
>> >> >
>> >> > Not supplied, will ask on the ticket and include it.
>> >>
>> >> I thought, I saw it somewhere, but don't find the reference anymore,
>> >> some kind of bisection algorithm, but having a reference would be
>> >> good.
>> >> Whatever the algorithm is, it's fast, even for larger values.
>> >>
>> >> >>
>> >> >> 6) Why exactly does the dtype need to int64? That is, is there
>> >> >> something
>> >> >> wrong with hypergeom function? I just want to understand why the
>> >> >> precision
>> >> >> change is required because the input should enter with sufficient
>> >> >> precision.
>> >> >>
>> >> > This test:
>> >> > fisher_exact(np.array([[18000, 80000], [20000, 90000]]))
>> >> > becomes much slower and gives an overflow warning with int32. int32
>> >> > is
>> >> > just
>> >> > not enough. This is just an implementation detail and does not in any
>> >> > way
>> >> > limit the accepted inputs, so I don't see a problem here.
>> >>
>> >> for large numbers like this the chisquare test should give almost the
>> >> same results, it looks pretty "asymptotic" to me. (the usual
>> >> recommendation for the chisquare is more than 5 expected observations
>> >> in each cell)
>> >> I think the precision is required for some edge cases when
>> >> probabilities get very small. The main failing case, I was fighting
>> >> with for several days last winter, and didn't manage to fix had a zero
>> >> at the first position. I didn't think about increasing the precision.
>> >>
>> >> >
>> >> > Don't know what the behavior should be if a user passes in floats
>> >> > though?
>> >> > Just convert to int like now, or raise a warning?
>> >>
>> >> I wouldn't do any type checking, and checking that floats are almost
>> >> integers doesn't sound really necessary either, unless or until users
>> >> complain. The standard usage should be pretty clear for contingency
>> >> tables with count data.
>> >>
>> >> Josef
>> >>
>> >
>> > Thanks for checking. https://github.com/rgommers/scipy/commit/b968ba17
>> > should fix remaining things. Will wait for a few days to see if we get a
>> > reference to the algorithm. Then will commit.
>>
>> Sorry but I don't agree. But I said I do not have time to address this
>> and I really do not like adding the code as it is.
>
> Bruce, I replied in detail to your previous email, so I'm not sure what you
> want me to do here. If you don't have time for more discussion, and Josef
> (as stats maintainer) is happy with the addition, I think it can go in.
> Actually, it did go in right before your email, but that's doesn't mean it's
> too late for some changes.
>
> Cheers,
> Ralf
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>

I know and the reason for my negativity is that this commit goes
against what I had proposed to provide single stats functions that
handle the various ndarray types not just the 'standard array. Also it
lacks the flexibility to handle general R by C cases which are very
common. But that requires time to find how to do those cases. The
error is that there is no dimensionality check .

I find it shocking that a statistical test returns a 'odds ratio' that
has nothing to do with the actual test nor with any of the other
related statistical tests like chisquare. If you accept that then we
must immediately add that odds ratio to ALL statistical tests.

Bruce


More information about the SciPy-User mailing list