[Numpy-discussion] please change mean to use dtype=float
David M. Cooke
cookedm at physics.mcmaster.ca
Thu Sep 21 16:59:37 CDT 2006
On Thu, 21 Sep 2006 11:34:42 -0700
Tim Hochberg <tim.hochberg at ieee.org> wrote:
> Tim Hochberg wrote:
> > Robert Kern wrote:
> >
> >> David M. Cooke wrote:
> >>
> >>
> >>> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
> >>>
> >>>
> >>>> Let me offer a third path: the algorithms used for .mean() and .var()
> >>>> are substandard. There are much better incremental algorithms that
> >>>> entirely avoid the need to accumulate such large (and therefore
> >>>> precision-losing) intermediate values. The algorithms look like the
> >>>> following for 1D arrays in Python:
> >>>>
> >>>> def mean(a):
> >>>> m = a[0]
> >>>> for i in range(1, len(a)):
> >>>> m += (a[i] - m) / (i + 1)
> >>>> return m
> >>>>
> >>>>
> >>> This isn't really going to be any better than using a simple sum.
> >>> It'll also be slower (a division per iteration).
> >>>
> >>>
> >> With one exception, every test that I've thrown at it shows that it's
> >> better for float32. That exception is uniformly spaced arrays, like
> >> linspace().
> >>
> >> > You do avoid
> >> > accumulating large sums, but then doing the division a[i]/len(a) and
> >> > adding that will do the same.
> >>
> >> Okay, this is true.
> >>
> >>
> >>
> >>> Now, if you want to avoid losing precision, you want to use a better
> >>> summation technique, like compensated (or Kahan) summation:
> >>>
> >>> def mean(a):
> >>> s = e = a.dtype.type(0)
> >>> for i in range(0, len(a)):
> >>> temp = s
> >>> y = a[i] + e
> >>> s = temp + y
> >>> e = (temp - s) + y
> >>> return s / len(a)
> >>
> >>>> def var(a):
> >>>> m = a[0]
> >>>> t = a.dtype.type(0)
> >>>> for i in range(1, len(a)):
> >>>> q = a[i] - m
> >>>> r = q / (i+1)
> >>>> m += r
> >>>> t += i * q * r
> >>>> t /= len(a)
> >>>> return t
> >>>>
> >>>> Alternatively, from Knuth:
> >>>>
> >>>> def var_knuth(a):
> >>>> m = a.dtype.type(0)
> >>>> variance = a.dtype.type(0)
> >>>> for i in range(len(a)):
> >>>> delta = a[i] - m
> >>>> m += delta / (i+1)
> >>>> variance += delta * (a[i] - m)
> >>>> variance /= len(a)
> >>>> return variance
>
> I'm going to go ahead and attach a module containing the versions of
> mean, var, etc that I've been playing with in case someone wants to mess
> with them. Some were stolen from traffic on this list, for others I
> grabbed the algorithms from wikipedia or equivalent.
I looked into this a bit more. I checked float32 (single precision) and
float64 (double precision), using long doubles (float96) for the "exact"
results. This is based on your code. Results are compared using
abs(exact_stat - computed_stat) / max(abs(values)), with 10000 values in the
range of [-100, 900]
First, the mean. In float32, the Kahan summation in single precision is
better by about 2 orders of magnitude than simple summation. However,
accumulating the sum in double precision is better by about 9 orders of
magnitude than simple summation (7 orders more than Kahan).
In float64, Kahan summation is the way to go, by 2 orders of magnitude.
For the variance, in float32, Knuth's method is *no better* than the two-pass
method. Tim's code does an implicit conversion of intermediate results to
float64, which is why he saw a much better result. The two-pass method using
Kahan summation (again, in single precision), is better by about 2 orders of
magnitude. There is practically no difference when using a double-precision
accumulator amongst the techniques: they're all about 9 orders of magnitude
better than single-precision two-pass.
In float64, Kahan summation is again better than the rest, by about 2 orders
of magnitude.
I've put my adaptation of Tim's code, and box-and-whisker plots of the
results, at http://arbutus.mcmaster.ca/dmc/numpy/variance/
Conclusions:
- If you're going to calculate everything in single precision, use Kahan
summation. Using it in double-precision also helps.
- If you can use a double-precision accumulator, it's much better than any of
the techniques in single-precision only.
- for speed+precision in the variance, either use Kahan summation in single
precision with the two-pass method, or use double precision with simple
summation with the two-pass method. Knuth buys you nothing, except slower
code :-)
After 1.0 is out, we should look at doing one of the above.
--
|>|\/|<
/--------------------------------------------------------------------------\
|David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/
|cookedm at physics.mcmaster.ca
More information about the Numpy-discussion
mailing list