[SciPy-User] Fisher exact test, anyone?
josef.pktd@gmai...
josef.pktd@gmai...
Sun Nov 21 14:18:16 CST 2010
On Sun, Nov 21, 2010 at 2:03 PM, Bruce Southey <bsouthey@gmail.com> wrote:
> On Sun, Nov 21, 2010 at 2:23 AM, Ralf Gommers
> <ralf.gommers@googlemail.com> wrote:
>>
>>
>> On Sat, Nov 20, 2010 at 1:35 AM, Bruce Southey <bsouthey@gmail.com> wrote:
>>>
>>> On Wed, Nov 17, 2010 at 7:24 AM, Ralf Gommers
>>> <ralf.gommers@googlemail.com> wrote:
>>> >
>>> >
>>> > On Wed, Nov 17, 2010 at 8:38 AM, <josef.pktd@gmail.com> wrote:
>>> >>
>>> >> On Tue, Nov 16, 2010 at 7:10 PM, Ralf Gommers
>>> >> <ralf.gommers@googlemail.com> wrote:
>>> >> >
>>> >> >
>>> >> > On Tue, Nov 16, 2010 at 11:45 PM, Bruce Southey <bsouthey@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> I have no problem including this if we can agree on the API because
>>> >> >> everything else is internal that can be fixed by release date. So I
>>> >> >> would
>>> >> >> accept a place holder API that enable a user in the future to select
>>> >> >> which
>>> >> >> tail(s) is performed.
>>> >> >
>>> >> > It is always possible to add a keyword "tail" later that defaults to
>>> >> > 2-tailed. As long as the behavior doesn't change this is perfectly
>>> >> > fine,
>>> >> > and
>>> >> > better than having a placeholder.
>>> >> >>
>>> >> >> 1) It just can not use np.asarray() without checking the input
>>> >> >> first.
>>> >> >> This
>>> >> >> is particularly bad for masked arrays.
>>> >> >>
>>> >> > Don't understand this. The input array is not returned, only used
>>> >> > internally. And I can't think of doing anything reasonable with a 2x2
>>> >> > table
>>> >> > with masked values. If that's possible at all, it should probably
>>> >> > just
>>> >> > go
>>> >> > into mstats.
>>> >> >
>>> >> >>
>>> >> >> 2) There are no dimension checking because, as I understand it, this
>>> >> >> can
>>> >> >> only handle a '2 by 2' table. I do not know enough for general 'r by
>>> >> >> c'
>>> >> >> tables or the 1-d case either.
>>> >> >>
>>> >> > Don't know how easy it would be to add larger tables. I can add
>>> >> > dimension
>>> >> > checking with an informative error message.
>>> >>
>>> >> There is some discussion in the ticket about more than 2by2,
>>> >> additions would be nice (and there are some examples on the matlab
>>> >> fileexchange), but 2by2 is the most common case and has an unambiguous
>>> >> definition.
>>> >>
>>> >>
>>> >> >
>>> >> >>
>>> >> >> 3) The odds-ratio should be removed because it is not part of the
>>> >> >> test.
>>> >> >> It
>>> >> >> is actually more general than this test.
>>> >> >>
>>> >> > Don't feel strongly about this either way. It comes almost for free,
>>> >> > and
>>> >> > R
>>> >> > seems to do the same.
>>> >>
>>> >> same here, it's kind of traditional to return two things, but in this
>>> >> case the odds ratio is not the test statistic, but I don't see that it
>>> >> hurts either
>>> >>
>>> >> >
>>> >> >> 4) Variable names such as min and max should not shadow Python
>>> >> >> functions.
>>> >> >
>>> >> > Yes, Josef noted this already, will change.
>>> >> >>
>>> >> >> 5) Is there a reference to the algorithm implemented? For example,
>>> >> >> SPSS
>>> >> >> provides a simple 2 by 2 algorithm:
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> http://support.spss.com/ProductsExt/SPSS/Documentation/Statistics/algorithms/14.0/app05_sig_fisher_exact_test.pdf
>>> >> >
>>> >> > Not supplied, will ask on the ticket and include it.
>>> >>
>>> >> I thought, I saw it somewhere, but don't find the reference anymore,
>>> >> some kind of bisection algorithm, but having a reference would be
>>> >> good.
>>> >> Whatever the algorithm is, it's fast, even for larger values.
>>> >>
>>> >> >>
>>> >> >> 6) Why exactly does the dtype need to int64? That is, is there
>>> >> >> something
>>> >> >> wrong with hypergeom function? I just want to understand why the
>>> >> >> precision
>>> >> >> change is required because the input should enter with sufficient
>>> >> >> precision.
>>> >> >>
>>> >> > This test:
>>> >> > fisher_exact(np.array([[18000, 80000], [20000, 90000]]))
>>> >> > becomes much slower and gives an overflow warning with int32. int32
>>> >> > is
>>> >> > just
>>> >> > not enough. This is just an implementation detail and does not in any
>>> >> > way
>>> >> > limit the accepted inputs, so I don't see a problem here.
>>> >>
>>> >> for large numbers like this the chisquare test should give almost the
>>> >> same results, it looks pretty "asymptotic" to me. (the usual
>>> >> recommendation for the chisquare is more than 5 expected observations
>>> >> in each cell)
>>> >> I think the precision is required for some edge cases when
>>> >> probabilities get very small. The main failing case, I was fighting
>>> >> with for several days last winter, and didn't manage to fix had a zero
>>> >> at the first position. I didn't think about increasing the precision.
>>> >>
>>> >> >
>>> >> > Don't know what the behavior should be if a user passes in floats
>>> >> > though?
>>> >> > Just convert to int like now, or raise a warning?
>>> >>
>>> >> I wouldn't do any type checking, and checking that floats are almost
>>> >> integers doesn't sound really necessary either, unless or until users
>>> >> complain. The standard usage should be pretty clear for contingency
>>> >> tables with count data.
>>> >>
>>> >> Josef
>>> >>
>>> >
>>> > Thanks for checking. https://github.com/rgommers/scipy/commit/b968ba17
>>> > should fix remaining things. Will wait for a few days to see if we get a
>>> > reference to the algorithm. Then will commit.
>>>
>>> Sorry but I don't agree. But I said I do not have time to address this
>>> and I really do not like adding the code as it is.
>>
>> Bruce, I replied in detail to your previous email, so I'm not sure what you
>> want me to do here. If you don't have time for more discussion, and Josef
>> (as stats maintainer) is happy with the addition, I think it can go in.
>> Actually, it did go in right before your email, but that's doesn't mean it's
>> too late for some changes.
>>
>> Cheers,
>> Ralf
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>>
>
> I know and the reason for my negativity is that this commit goes
> against what I had proposed to provide single stats functions that
> handle the various ndarray types not just the 'standard array.
In this case, I see no reason to support any other array types. The
function asks for 4 numbers and returns 2 numbers, no nans or missing
values are allowed. (There is nothing to calculate if one of the 4
values is missing.) I haven't checked carefully this time, but, I
think, any array_like type that allows indexing should by correctly
used.
> Also it
> lacks the flexibility to handle general R by C cases which are very
> common. But that requires time to find how to do those cases. The
> error is that there is no dimensionality check .
Handling more than 2x2 would be a nice extension, but is no reason to
not commit the current version.
It might be good to raise a ValueError if the input is not 2x2. I need
to check whether for example 2x3 wouldn't silently produce an
unintended result.
>
> I find it shocking that a statistical test returns a 'odds ratio' that
> has nothing to do with the actual test nor with any of the other
> related statistical tests like chisquare. If you accept that then we
> must immediately add that odds ratio to ALL statistical tests.
R returns the posterior odds ratio, we return the prior odds-ratio. I
don't care much either way, but the test is (implicitly) on the odds
ratio.
Other tests provide their own statistic, which is not always the
actual statistic used in the calculation of the p-value. If it were
not for backward compatibility, I would add lots of things to ALL
statistical tests, as you can see in some tests in statsmodels.
And did I mention: It's really fast.
Josef
>
> Bruce
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
More information about the SciPy-User
mailing list