[SciPy-User] bug in rankdata?
Warren Weckesser
warren.weckesser@gmail....
Sat Feb 16 12:22:57 CST 2013
On 2/15/13, Chris Rodgers <xrodgers@gmail.com> wrote:
> Thanks very much! I discovered this bug because mann-whitney U was
> giving me bizarre results, like a negative U statistic. My data is a
> large number of integer counts, mostly zeros, which is the worst case
> for ties.
>
> Until I can update scipy, I'll either write my own rankdata method,
> which will be very slow, or I'll use the R equivalent which is more
> feature-ful (but then I have to figure out rpy2 which will also be
> slow).
You could also try pandas (http://pandas.pydata.org/). The DataFrame
and Series classes have a 'rank' method
(http://pandas.pydata.org/pandas-docs/stable/computation.html#data-ranking).
Warren
>
> On Fri, Feb 15, 2013 at 10:22 AM, Warren Weckesser
> <warren.weckesser@gmail.com> wrote:
>>
>>
>> On Fri, Feb 15, 2013 at 10:32 AM, Warren Weckesser
>> <warren.weckesser@gmail.com> wrote:
>>>
>>> On 2/14/13, Chris Rodgers <xrodgers@gmail.com> wrote:
>>> > The results I'm getting from rankdata seem completely wrong for large
>>> > datasets. I'll illustrate with a case where all data are equal, so
>>> > every rank should be len(data) / 2 + 0.5.
>>> >
>>> > In [220]: rankdata(np.ones((10000,), dtype=np.int))
>>> > Out[220]: array([ 5000.5, 5000.5, 5000.5, ..., 5000.5, 5000.5,
>>> > 5000.5])
>>> >
>>> > In [221]: rankdata(np.ones((100000,), dtype=np.int))
>>> > Out[221]:
>>> > array([ 7050.82704, 7050.82704, 7050.82704, ..., 7050.82704,
>>> > 7050.82704, 7050.82704])
>>> >
>>> > In [222]: rankdata(np.ones((1000000,), dtype=np.int))
>>> > Out[222]:
>>> > array([ 1784.293664, 1784.293664, 1784.293664, ..., 1784.293664,
>>> > 1784.293664, 1784.293664])
>>> >
>>> > In [223]: scipy.__version__
>>> > Out[223]: '0.11.0'
>>> >
>>> > In [224]: numpy.__version__
>>> > Out[224]: '1.6.1'
>>> >
>>> >
>>> > The results are completely off for N>10000 or so. Am I doing something
>>> > wrong?
>>>
>>>
>>> Looks like a bug. The code that accumulates the ranks of the tied
>>> values is using a 32 bit integer for the sum of the ranks, and this is
>>> overflowing. I'll see if I can get this fixed for the imminent
>>> release of 0.12.
>>>
>>> Warren
>>>
>>
>>
>> A pull request with the fix is here:
>> https://github.com/scipy/scipy/pull/436
>>
>>
>> Warren
>>
>>
>>>
>>> > _______________________________________________
>>> > SciPy-User mailing list
>>> > SciPy-User@scipy.org
>>> > http://mail.scipy.org/mailman/listinfo/scipy-user
>>> >
>>
>>
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
More information about the SciPy-User
mailing list