[Numpy-discussion] Scalar casting rules use-case reprise

Nathaniel Smith njs@pobox....
Fri Jan 4 07:46:40 CST 2013


On Fri, Jan 4, 2013 at 11:09 AM, Matthew Brett <matthew.brett@gmail.com> wrote:
> Hi,
>
> Reading the discussion on the scalar casting rule change I realized I
> was hazy on the use-cases that led to the rule that scalars cast
> differently from arrays.
>
> My impression was that the primary use-case was for lower-precision
> floats. That is, when you have a large float32 arr, you do not want to
> double your memory use with:
>
>>>> large_float32 + 1.0 # please no float64 here
>
> Probably also:
>
>>>> large_int8 + 1 # please no int32 / int64 here.
>
> That makes sense.  On the other hand these are more ambiguous:
>
>>>> large_float32 + np.float64(1) # really - you don't want float64?
>
>>>> large_int8 + np.int32(1) # ditto
>
> I wonder whether the main use-case was to deal with the automatic
> types of Python floats and scalars?  That is, I wonder whether it
> would be worth considering (in the distant long term), doing fancy
> guess-what-you-mean stuff with Python scalars, on the basis that they
> are of unspecified dtype, and make 0 dimensional scalars follow the
> array casting rules.  As in:
>
>>>> large_float32 + 1.0
> # no upcast - we don't know what float type you meant for the scalar
>>>> large_float32 + np.float64(1)
> # upcast - you clearly meant the scalar to be float64

Hmm, but consider this, which is exactly the operation in your example:

In [9]: a = np.arange(3, dtype=np.float32)

In [10]: a / np.mean(a) # normalize
Out[10]: array([ 0.,  1.,  2.], dtype=float32)

In [11]: type(np.mean(a))
Out[11]: numpy.float64

Obviously the most common situation where it's useful to have the rule
to ignore scalar width is for avoiding "width contamination" from
Python float and int literals. But you can easily end up with numpy
scalars from indexing, high-precision operations like np.mean, etc.,
where you don't "really mean" you want high-precision. And at least
it's easy to understand the rule: same-kind scalars don't affect
precision.

...Though arguably the bug here is that np.mean actually returns a
value with higher precision. Interestingly, we seem to have some
special cases so that if you want to normalize each row of a matrix,
then again the dtype is preserved, but for a totally different
reasons. In

a = np.arange(4, dtype=np.float32).reshape((2, 2))
a / np.mean(a, axis=0, keepdims=True)

the result has float32 type, even though this is an array/array
operation, not an array/scalar operation. The reason is:

In [32]: np.mean(a).dtype
Out[32]: dtype('float64')

But:

In [33]: np.mean(a, axis=0).dtype
Out[33]: dtype('float32')

In this respect np.var and np.std behave like np.mean, but np.sum
always preserves the input dtype. (Which is curious because np.sum is
just like np.mean in terms of potential loss of precision, right? The
problem in np.mean is the accumulating error over many addition
operations, not the divide-by-n at the end.)

It is very disturbing that even after this discussion none of us here
seem to actually have a precise understanding of how the numpy type
selection system actually works :-(. We really need a formal
description...

-n


More information about the NumPy-Discussion mailing list