[Numpy-discussion] Bug in numpy.mean() revisited

Tom Aldcroft aldcroft@head.cfa.harvard....
Thu Jul 26 22:26:56 CDT 2012

There was a thread in January discussing the non-obvious behavior of
numpy.mean() for large arrays of float32 values [1].  This issue is
nicely discussed at the end of the numpy.mean() documentation [2] with
an example:

>>> a = np.zeros((2, 512*512), dtype=np.float32)
>>> a[0, :] = 1.0
>>> a[1, :] = 0.1
>>> np.mean(a)

>From the docs and previous discussion it seems there is no technical
difficulty in choosing a different (higher precision) type for the
accumulator using the dtype arg, and in fact this is done
automatically for int values.

My question is whether there would be any support for doing something
more than documenting this behavior.  I suspect very few people ever
make it below the fold for the np.mean() documentation.  Taking the
mean of large arrays of float32 values is a *very* common use case and
giving the wrong answer with default inputs is really disturbing.  I
recently had to rebuild a complex science data archive because of
corrupted mean values.

Possible ideas to stimulate discussion:
1. Always use float64 to accumulate float types that are 64 bits or
less.   Are there serious performance impacts to automatically using
float64 to accumulate float32 arrays?  I appreciate this would likely
introduce unwanted regressions (sometimes suddenly getting the right
answer is a bad thing).  So could this be considered for numpy 2.0?

2. Might there be a way to emit a warning if the number of values and
the max accumulated value [3] are such that the estimated fractional
error is above some tolerance?  I'm not even sure if this is a good
idea or if there will be howls from the community as their codes start
warning about inaccurate mean values.  Better idea along this line??


[1]: http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059960.html
[2]: http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
[3]: Using the max accumulated value during accumulation instead of
the final accumulated value seems like the right thing for estimating
precision loss.  But this would affect performance so maybe just using
the final value would catch many cases.

More information about the NumPy-Discussion mailing list