[Numpy-discussion] please change mean to use dtype=float
Tim Hochberg
tim.hochberg at ieee.org
Wed Sep 20 11:08:15 CDT 2006
Robert Kern wrote:
> Sebastian Haase wrote:
>
>> Robert Kern wrote:
>>
>>> Sebastian Haase wrote:
>>>
>>>> I know that having too much knowledge of the details often makes one
>>>> forget what the "newcomers" will do and expect.
>>>>
>>> Please be more careful with such accusations. Repeated frequently, they can
>>> become quite insulting.
>>>
>>>
>> I did not mean to insult anyone - what I meant was, that I'm for numpy
>> becoming an easy platform to use. I have spend and enjoyed part the last
>> four years developing and evangelizing Python as an alternative to
>> Matlab and C/Fortran based image analysis environment. I often find
>> myself arguing for good support of the single precision data format. So
>> I find it actually somewhat ironic to see myself arguing now for wanting
>> float64 over float32 ;-)
>>
>
> No one is doubting that you want numpy to be easy to use. Please don't doubt
> that the rest of us want otherwise. However, the fact that you *want* numpy to
> be easy to use does not mean that your suggestions *will* make numpy easy to use.
>
> We haven't forgotten what newcomers will do; to the contrary, we are quite aware
> that new users need consistent behavior in order to learn how to use a system.
> Adding another special case in how dtypes implicitly convert to one another will
> impede new users being able to understand the whole system. See A. M.
> Archibald's question in the thread "ufunc.reduce and conversion" for an example.
> In our judgement this is a worse outcome than notational convenience for float32
> users, who already need to be aware of the effects of their precision choice.
> Each of us can come to different conclusions in good faith without one of us
> forgetting the new user experience.
>
> Let me offer a third path: the algorithms used for .mean() and .var() are
> substandard. There are much better incremental algorithms that entirely avoid
> the need to accumulate such large (and therefore precision-losing) intermediate
> values. The algorithms look like the following for 1D arrays in Python:
>
> def mean(a):
> m = a[0]
> for i in range(1, len(a)):
> m += (a[i] - m) / (i + 1)
> return m
>
> def var(a):
> m = a[0]
> t = a.dtype.type(0)
> for i in range(1, len(a)):
> q = a[i] - m
> r = q / (i+1)
> m += r
> t += i * q * r
> t /= len(a)
> return t
>
> Alternatively, from Knuth:
>
> def var_knuth(a):
> m = a.dtype.type(0)
> variance = a.dtype.type(0)
> for i in range(len(a)):
> delta = a[i] - m
> m += delta / (i+1)
> variance += delta * (a[i] - m)
> variance /= len(a)
> return variance
>
> If you will code up implementations of these for ndarray.mean() and
> ndarray.var(), I will check them in and then float32 arrays will avoid most of
> the catastrophes that the current implementations run into.
>
+1
>
>>>> We are only talking
>>>> about people that will a) work with single-precision data (e.g. large
>>>> scale-image analysis) and who b) will tend to "just use the default"
>>>> (dtype) --- How else can I say this: these people will just assume that
>>>> arr.mean() *is* the mean of arr.
>>>>
>>> I don't understand what you mean, here. arr.mean() is almost never *the* mean of
>>> arr. Double precision can't fix that.
>>>
>>>
>> This was not supposed to be a scientific statement -- I'm (again)
>> thinking of our students that not always appreciate the full complexity
>> of computational numerics and data types and such.
>>
>
> They need to appreciate the complexity of computational numerics if they are
> going to do numerical computation. Double precision does not make it any simpler.
>
>
More information about the Numpy-discussion
mailing list