[Numpy-discussion] please change mean to use dtype=float

David M. Cooke cookedm at physics.mcmaster.ca
Wed Sep 20 11:34:17 CDT 2006


On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
> Let me offer a third path: the algorithms used for .mean() and .var() are 
> substandard. There are much better incremental algorithms that entirely avoid 
> the need to accumulate such large (and therefore precision-losing) intermediate 
> values. The algorithms look like the following for 1D arrays in Python:
> 
> def mean(a):
>      m = a[0]
>      for i in range(1, len(a)):
>          m += (a[i] - m) / (i + 1)
>      return m

This isn't really going to be any better than using a simple sum.
It'll also be slower (a division per iteration). You do avoid
accumulating large sums, but then doing the division a[i]/len(a) and
adding that will do the same.

Now, if you want to avoid losing precision, you want to use a better
summation technique, like compensated (or Kahan) summation:

def mean(a):
    s = e = a.dtype.type(0)
    for i in range(0, len(a)):
        temp = s
        y = a[i] + e
        s = temp + y
        e = (temp - s) + y
    return s / len(a)

Some numerical experiments in Maple using 5-digit precision show that
your mean is maybe a bit better in some cases, but can also be much
worse, than sum(a)/len(a), but both are quite poor in comparision to the
Kahan summation.

(We could probably use a fast implementation of Kahan summation in
addition to a.sum())

> def var(a):
>      m = a[0]
>      t = a.dtype.type(0)
>      for i in range(1, len(a)):
>          q = a[i] - m
>          r = q / (i+1)
>          m += r
>          t += i * q * r
>      t /= len(a)
>      return t
> 
> Alternatively, from Knuth:
> 
> def var_knuth(a):
>      m = a.dtype.type(0)
>      variance = a.dtype.type(0)
>      for i in range(len(a)):
>          delta = a[i] - m
>          m += delta / (i+1)
>          variance += delta * (a[i] - m)
>      variance /= len(a)
>      return variance

These formulas are good when you can only do one pass over the data
(like in a calculator where you don't store all the data points), but
are slightly worse than doing two passes. Kahan summation would probably
also be good here too.

-- 
|>|\/|<
/--------------------------------------------------------------------------\
|David M. Cooke                      http://arbutus.physics.mcmaster.ca/dmc/
|cookedm at physics.mcmaster.ca




More information about the Numpy-discussion mailing list