[Scipy-tickets] [SciPy] #1805: rankdata returns wrong results on masked arrays

SciPy Trac scipy-tickets@scipy....
Thu Jan 3 09:53:03 CST 2013


#1805: rankdata returns wrong results on masked arrays
-------------------------+--------------------------------------------------
 Reporter:  hberger      |       Owner:  rgommers   
     Type:  defect       |      Status:  new        
 Priority:  normal       |   Milestone:  Unscheduled
Component:  scipy.stats  |     Version:  0.11.0     
 Keywords:               |  
-------------------------+--------------------------------------------------
 To reproduce this bug:

 # Win7, x-64, Python 2.7.3
 # numpy-1.6.2
 # scipy-0.11.0
 ###  ---- incorrect ranking with masked values---

 import numpy as np

 a = ['NaN','NaN','NaN','NaN','NaN','NaN','NaN',2.39, -1e-10, 2.4e+30,
 0.42, 'NaN', 2.4]

 aa = np.array(a,'float')

 aa

 -> array([             nan,              nan,              nan,
                     nan,              nan,              nan,
                     nan,   2.39000000e+00,  -1.00000000e-10,
          2.40000000e+30,   4.20000000e-01,              nan,
          2.40000000e+00])

 from scipy import stats

 stats.mstats_basic.rankdata(aa, use_missing=False)

 -> array([  6.,   7.,   8.,   9.,  10.,  11.,  12.,   3.,   1.,   5.,
 2.,
         13.,   4.])

 # Problem 1 - the nan values are ranked 6.-12., only below the highest
 value. Missing values are not reported as rank 0.

 aam=np.ma.masked_array(aa,mask=np.isnan(aa))

 aam

 -> masked_array(data = [-- -- -- -- -- -- -- 2.39 -1e-10 2.4e+30 0.42 --
 2.4],
              mask = [ True  True  True  True  True  True  True False False
 False False  True
  False],
        fill_value = 1e+20)

 stats.mstats_basic.rankdata(aam, use_missing=False)

 -> array([ 5.,  0.,  0.,  0.,  0.,  0.,  0.,  3.,  1.,  0.,  2.,  0.,
 4.])

 # Problem 2 - the nan values are reported as 0 except the first one.,
 while the highest value is also reported as 0

 ###  ---- correct ranking ---

 b = np.array([2.39, -1e-10, 2.4e+30, 0.42, 2.4],"float")

 b

 -> array([  2.39000000e+00,  -1.00000000e-10,   2.40000000e+30,
          4.20000000e-01,   2.40000000e+00])

 stats.mstats_basic.rankdata(b, use_missing=False)

 -> array([ 3.,  1.,  5.,  2.,  4.])


 Possible explanation:
 in _rank1d:
  masked_array.argsort(axis=None, kind='quicksort', order=None,
 fill_value=None)
    Return an ndarray of indices that sort the array along the specified
 axis.
    *Masked values are filled beforehand to fill_value.*

 aam.argsort()

 -> array([ 8, 10,  7, 12,  0,  1,  2,  3,  4,  5,  6, 11,  9],
 dtype=int64)

 # rank1d will then treat everything above index 5 in the argsort result as
 missing and
 # everything up to 5 as "valid"
 # this will set the position 0 of the original data to a valid rank while
 position 9 is treated as missing

-- 
Ticket URL: <http://projects.scipy.org/scipy/ticket/1805>
SciPy <http://www.scipy.org>
SciPy is open-source software for mathematics, science, and engineering.


More information about the Scipy-tickets mailing list