[Numpy-discussion] sum and mean methods behaviour
verveer at embl.de
verveer at embl.de
Tue Sep 2 13:33:08 CDT 2003
Hi Todd,
> I thought about this a lot yesterday and today talked it over with
> Perry. There are several ways to fix the problem with mean() and
> sum(), and I'm hoping that you and the rest of the community will help
> sort them out.
It was just an innocent question, I did not think it would have such
ramifications :-)
Here are my thoughts:
If I understand you well, the sum() and mean() array methods are based on the
reduce method of the universal functions. And these do their calculations in
the precision of the array, is that correct?
I also gave this some thought, and I would like to make a distinction between
a reduction and the calculation of a statistical value such as the mean or the
sum:
To me, a reduction means the projection of an multi-dimensional array to an
array with a rank that is one less than the input. The result is still an
array, and often I want the result to have the same precision as the input.
A statistical calculation like a sum or a mean is different: the result should
be correct and the same irrespective of the type of the input, and that
mandates using sufficient precision in the calculation. Note however, that
such a statistic is a scalar result and does not require temporary storage at
high precision for the whole array.
So keeping this in mind, my comments to your solutions are:
> (1) The first "solution" is to require users to do their own up-casting
> prior to calling mean() or sum(). This gives the end user fine control
> over storage cost but leaves the C-like pitfall/bug you discovered. I
> mention this because this is how the numarray/Numeric reductions are
> designed. Is there a reason why the numarray/Numeric reductions don't
> implicitly up-cast?
For reductions this behaviour suits me, precisely because it allows control
over storage, which is one of the strengths of numarray.
For calculating the mean or the sum of an array this is however an expensive
solution for a very common operation. I do use this solution, but sometimes I
prefer an optimized C routine instead.
> (2) The second way is what you proposed: use double precision within
> mean and sum. This has great simplicity but gives no control over
> storage usage, and as implemented, the storage would be much higher than
> one might think, potentially 8x.
I did not want to suggest to store a casted version of the array before
calculation of the mean or the sum. That can be done in double precision
without converting the whole array in memory. I think we can all agree that
this option would not be a good idea.
> (3) Lastly, Perry suggested a more radical approach: rather than
> changing the mean and sum methods themselves, we could alter the
> universal function accumulate and reduce methods to implicitly use
> additional precision. Perry's idea was to make all accumulations and
> reductions up-cast their results to the largest type of the current
> family, either Bool, Int64, Float64, or Complex64. By doing this, we
> can improve the utility of the reductions and accumulations as well as
> fixing the problem with sum and mean.
I think that is a great idea in principle, but I think you should consider
this carefully:
First of all control of the storage cost is lost when you do a reduction. I
would not find that always desirable. Thus, I would then like the old
behaviour for reduction to be accesible either as a different method or by a
setting an optional argument.
Additionally, it would not work well for some operations. For instance precise
calculation of the mean requires floating point precision. Maybe this can be
solved, but would it require different casting behaviour for different
operations. That might be too much trouble...
I would like to propose a fourth option:
(4) provide separate implementations for array methods like sum() and mean()
that only calculate the scalar result. No additional storage would be
necessary and the calculation can be done in double precision.
I guess that the disadvantage is that one cannot leverage the existing code in
the ufuncs so easily. I also assume that it would not be such a general
solution as changing the reduce method is.
I do have some experience in writing these sorts of code for multidimensional
arrays in C and would be happy to contribute code. However, I am not too
familiar with the internals of numarray library and I don't know how well my
code fits in there (although I interface all my code to numarray). But I am
happy to help if I can, numarray is great stuff, it has become the main tool
for my numerical work.
Peter
--
Dr. Peter J. Verveer
Cell Biology and Cell Biophysics Programme
EMBL
Meyerhofstrasse 1
D-69117 Heidelberg
Germany
Tel. : +49 6221 387245
Fax : +49 6221 387242
More information about the Numpy-discussion
mailing list