[Numpy-discussion] please change mean to use dtype=float

Tim Hochberg tim.hochberg at ieee.org
Wed Sep 20 11:08:15 CDT 2006

Robert Kern wrote:
> Sebastian Haase wrote:
>> Robert Kern wrote:
>>> Sebastian Haase wrote:
>>>> I know that having too much knowledge of the details often makes one 
>>>> forget what the "newcomers" will do and expect.
>>> Please be more careful with such accusations. Repeated frequently, they can 
>>> become quite insulting.
>> I did not mean to insult anyone - what I meant was, that I'm for numpy 
>> becoming an easy platform to use. I have spend and enjoyed part the last 
>> four years developing and evangelizing Python as an alternative to 
>> Matlab and C/Fortran based image analysis environment. I often find 
>> myself arguing for good support of the single precision data format. So 
>> I find it actually somewhat ironic to see myself arguing now for wanting 
>> float64 over float32 ;-)
> No one is doubting that you want numpy to be easy to use. Please don't doubt 
> that the rest of us want otherwise. However, the fact that you *want* numpy to 
> be easy to use does not mean that your suggestions *will* make numpy easy to use.
> We haven't forgotten what newcomers will do; to the contrary, we are quite aware 
> that new users need consistent behavior in order to learn how to use a system. 
> Adding another special case in how dtypes implicitly convert to one another will 
> impede new users being able to understand the whole system. See A. M. 
> Archibald's question in the thread "ufunc.reduce and conversion" for an example. 
> In our judgement this is a worse outcome than notational convenience for float32 
> users, who already need to be aware of the effects of their precision choice. 
> Each of us can come to different conclusions in good faith without one of us 
> forgetting the new user experience.
> Let me offer a third path: the algorithms used for .mean() and .var() are 
> substandard. There are much better incremental algorithms that entirely avoid 
> the need to accumulate such large (and therefore precision-losing) intermediate 
> values. The algorithms look like the following for 1D arrays in Python:
> def mean(a):
>      m = a[0]
>      for i in range(1, len(a)):
>          m += (a[i] - m) / (i + 1)
>      return m
> def var(a):
>      m = a[0]
>      t = a.dtype.type(0)
>      for i in range(1, len(a)):
>          q = a[i] - m
>          r = q / (i+1)
>          m += r
>          t += i * q * r
>      t /= len(a)
>      return t
> Alternatively, from Knuth:
> def var_knuth(a):
>      m = a.dtype.type(0)
>      variance = a.dtype.type(0)
>      for i in range(len(a)):
>          delta = a[i] - m
>          m += delta / (i+1)
>          variance += delta * (a[i] - m)
>      variance /= len(a)
>      return variance
> If you will code up implementations of these for ndarray.mean() and 
> ndarray.var(), I will check them in and then float32 arrays will avoid most of 
> the catastrophes that the current implementations run into.

>>>> We are only talking 
>>>> about people that will a) work with single-precision data (e.g. large 
>>>> scale-image analysis) and who b) will tend to "just use the default" 
>>>> (dtype)  --- How else can I say this: these people will just assume that 
>>>> arr.mean() *is* the mean of arr.
>>> I don't understand what you mean, here. arr.mean() is almost never *the* mean of 
>>> arr. Double precision can't fix that.
>> This was not supposed to be a scientific statement -- I'm (again) 
>> thinking of our students that not always appreciate the full complexity 
>> of computational numerics and data types and such.
> They need to appreciate the complexity of computational numerics if they are 
> going to do numerical computation. Double precision does not make it any simpler.

More information about the Numpy-discussion mailing list