[Numpy-discussion] please change mean to use dtype=float

Tim Hochberg tim.hochberg at ieee.org
Wed Sep 20 13:53:41 CDT 2006


Robert Kern wrote:
> David M. Cooke wrote:
>   
>> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
>>     
>>> Let me offer a third path: the algorithms used for .mean() and .var() are 
>>> substandard. There are much better incremental algorithms that entirely avoid 
>>> the need to accumulate such large (and therefore precision-losing) intermediate 
>>> values. The algorithms look like the following for 1D arrays in Python:
>>>
>>> def mean(a):
>>>      m = a[0]
>>>      for i in range(1, len(a)):
>>>          m += (a[i] - m) / (i + 1)
>>>      return m
>>>       
>> This isn't really going to be any better than using a simple sum.
>> It'll also be slower (a division per iteration).
>>     
>
> With one exception, every test that I've thrown at it shows that it's better for 
> float32. That exception is uniformly spaced arrays, like linspace().
>   
Here's version of mean based on the Kahan sum, including the 
compensation term that seems to be consistently much better than builtin 
mean  It seems to be typically 4 orders or magnitude closer to the 
"correct" answer than the builtin mean and 2 orders of magnitude better 
than just naively using the Kahan sum. I'm using the sum performed with 
dtype=int64 as the "correct" value.


def rawKahanSum(values):  # Stolen from mathworld
    if not input:
        return 0.0
    total = values[0]
    c = type(total)(0.0)            # A running compensation for lost 
low-order bits.
    for x in values[1:]:
        y = x - c                 # So far, so good: c is zero.
        t = total + y               # Alas, total is big, y small, so 
low-order digits of y are lost.
        c = (t - total) - y        # (t - total) recovers the high-order 
part of y; subtracting y recovers -(low part of y)
        total = t                   # Algebriacally, c should always be 
zero. Beware eagerly optimising compilers!
                                  # Next time around, the lost low part 
will be added to y in a fresh attempt.
    return total, c
   
def kahanSum(values):
    total, c =  rawKahanSum(values)
    return total
   
def mean(values):
    total, c = rawKahanSum(values)
    n = float(len(values))
    return total /  n - c / n
   


for i in range(100):
    data = random.random([10000]).astype(float32)
    baseline = data.mean(dtype=float64) 
    delta_builtin_mean = baseline - data.mean()
    delta_compensated_mean = baseline - mean(data)
    delta_kahan_mean = baseline - (kahanSum(data) / len(data))
    if not abs(delta_builtin_mean) >= abs(delta_kahan_mean) >= 
abs(delta_compensated_mean):
        print ">>>",
    print delta_builtin_mean, delta_kahan_mean, delta_compensated_mean

-tim
>  > You do avoid
>  > accumulating large sums, but then doing the division a[i]/len(a) and
>  > adding that will do the same.
>
> Okay, this is true.
>
>   
>> Now, if you want to avoid losing precision, you want to use a better
>> summation technique, like compensated (or Kahan) summation:
>>
>> def mean(a):
>>     s = e = a.dtype.type(0)
>>     for i in range(0, len(a)):
>>         temp = s
>>         y = a[i] + e
>>         s = temp + y
>>         e = (temp - s) + y
>>     return s / len(a)
>>
>> Some numerical experiments in Maple using 5-digit precision show that
>> your mean is maybe a bit better in some cases, but can also be much
>> worse, than sum(a)/len(a), but both are quite poor in comparision to the
>> Kahan summation.
>>
>> (We could probably use a fast implementation of Kahan summation in
>> addition to a.sum())
>>     
>
> +1
>
>   
>>> def var(a):
>>>      m = a[0]
>>>      t = a.dtype.type(0)
>>>      for i in range(1, len(a)):
>>>          q = a[i] - m
>>>          r = q / (i+1)
>>>          m += r
>>>          t += i * q * r
>>>      t /= len(a)
>>>      return t
>>>
>>> Alternatively, from Knuth:
>>>
>>> def var_knuth(a):
>>>      m = a.dtype.type(0)
>>>      variance = a.dtype.type(0)
>>>      for i in range(len(a)):
>>>          delta = a[i] - m
>>>          m += delta / (i+1)
>>>          variance += delta * (a[i] - m)
>>>      variance /= len(a)
>>>      return variance
>>>       
>> These formulas are good when you can only do one pass over the data
>> (like in a calculator where you don't store all the data points), but
>> are slightly worse than doing two passes. Kahan summation would probably
>> also be good here too.
>>     
>
> Again, my tests show otherwise for float32. I'll condense my ipython log into a 
> module for everyone's perusal. It's possible that the Kahan summation of the 
> squared residuals will work better than the current two-pass algorithm and the 
> implementations I give above.
>
>   






More information about the Numpy-discussion mailing list