[Numpy-discussion] please change mean to use dtype=float
Tim Hochberg
tim.hochberg at ieee.org
Wed Sep 20 14:02:38 CDT 2006
[Sorry, this version should have less munged formatting since I clipped
the comments. Oh, and the Kahan sum algorithm was grabbed from
wikipedia, not mathworld]
Tim Hochberg wrote:
> Robert Kern wrote:
>
>> David M. Cooke wrote:
>>
>>
>>> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
>>>
>>>
>>>> Let me offer a third path: the algorithms used for .mean() and .var() are
>>>> substandard. There are much better incremental algorithms that entirely avoid
>>>> the need to accumulate such large (and therefore precision-losing) intermediate
>>>> values. The algorithms look like the following for 1D arrays in Python:
>>>>
>>>> def mean(a):
>>>> m = a[0]
>>>> for i in range(1, len(a)):
>>>> m += (a[i] - m) / (i + 1)
>>>> return m
>>>>
>>>>
>>> This isn't really going to be any better than using a simple sum.
>>> It'll also be slower (a division per iteration).
>>>
>>>
>> With one exception, every test that I've thrown at it shows that it's better for
>> float32. That exception is uniformly spaced arrays, like linspace().
>>
>>
> Here's version of mean based on the Kahan sum, including the
> compensation term that seems to be consistently much better than builtin
> mean It seems to be typically 4 orders or magnitude closer to the
> "correct" answer than the builtin mean and 2 orders of magnitude better
> than just naively using the Kahan sum. I'm using the sum performed with
> dtype=int64 as the "correct" value.
>
>
>
def rawKahanSum(values):
if not input:
return 0.0
total = values[0]
c = type(total)(0.0)
for x in values[1:]:
y = x - c
t = total + y
c = (t - total) - y
total = t
return total, c
def kahanSum(values):
total, c = rawKahanSum(values)
return total
def mean(values):
total, c = rawKahanSum(values)
n = float(len(values))
return total / n - c / n
for i in range(100):
data = random.random([10000]).astype(float32)
baseline = data.mean(dtype=float64)
delta_builtin_mean = baseline - data.mean()
delta_compensated_mean = baseline - mean(data)
delta_kahan_mean = baseline - (kahanSum(data) / len(data))
if not abs(delta_builtin_mean) >= abs(delta_kahan_mean) >=
abs(delta_compensated_mean):
print ">>>",
print delta_builtin_mean, delta_kahan_mean, delta_compensated_mean
> -tim
>
>> > You do avoid
>> > accumulating large sums, but then doing the division a[i]/len(a) and
>> > adding that will do the same.
>>
>> Okay, this is true.
>>
>>
>>
>>> Now, if you want to avoid losing precision, you want to use a better
>>> summation technique, like compensated (or Kahan) summation:
>>>
>>> def mean(a):
>>> s = e = a.dtype.type(0)
>>> for i in range(0, len(a)):
>>> temp = s
>>> y = a[i] + e
>>> s = temp + y
>>> e = (temp - s) + y
>>> return s / len(a)
>>>
>>> Some numerical experiments in Maple using 5-digit precision show that
>>> your mean is maybe a bit better in some cases, but can also be much
>>> worse, than sum(a)/len(a), but both are quite poor in comparision to the
>>> Kahan summation.
>>>
>>> (We could probably use a fast implementation of Kahan summation in
>>> addition to a.sum())
>>>
>>>
>> +1
>>
>>
>>
>>>> def var(a):
>>>> m = a[0]
>>>> t = a.dtype.type(0)
>>>> for i in range(1, len(a)):
>>>> q = a[i] - m
>>>> r = q / (i+1)
>>>> m += r
>>>> t += i * q * r
>>>> t /= len(a)
>>>> return t
>>>>
>>>> Alternatively, from Knuth:
>>>>
>>>> def var_knuth(a):
>>>> m = a.dtype.type(0)
>>>> variance = a.dtype.type(0)
>>>> for i in range(len(a)):
>>>> delta = a[i] - m
>>>> m += delta / (i+1)
>>>> variance += delta * (a[i] - m)
>>>> variance /= len(a)
>>>> return variance
>>>>
>>>>
>>> These formulas are good when you can only do one pass over the data
>>> (like in a calculator where you don't store all the data points), but
>>> are slightly worse than doing two passes. Kahan summation would probably
>>> also be good here too.
>>>
>>>
>> Again, my tests show otherwise for float32. I'll condense my ipython log into a
>> module for everyone's perusal. It's possible that the Kahan summation of the
>> squared residuals will work better than the current two-pass algorithm and the
>> implementations I give above.
>>
>>
>>
>
>
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys -- and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/numpy-discussion
>
>
>
More information about the Numpy-discussion
mailing list