[Numpy-discussion] Re: scipy.stats.itemfreq: overflow with add.reduce

Todd Miller jmiller at stsci.edu
Thu Dec 22 08:30:25 CST 2005


Dr. Hans Georg Krauthaeuser wrote:

> Tim Churches schrieb:
>
>> Hans Georg Krauthaeuser wrote:
>>
>>> Hans Georg Krauthaeuser schrieb:
>>>
>>>
>>>> Hi All,
>>>>
>>>> I was playing with scipy.stats.itemfreq when I observed the 
>>>> following overflow:
>>>>
>>>> In [119]:for i in [254,255,256,257,258]:
>>>>  .....:    l=[0]*i
>>>>  .....:    print i, stats.itemfreq(l), l.count(0)
>>>>  .....:
>>>> 254 [ [  0 254]] 254
>>>> 255 [ [  0 255]] 255
>>>> 256 [ [0 0]] 256
>>>> 257 [ [0 1]] 257
>>>> 258 [ [0 2]] 258
>>>>
>>>> itemfreq is pretty small (in stats.py):
>>>>
>>>> ----------------------------------------------------------------------
>>>> def itemfreq(a):
>>>>   """
>>>> Returns a 2D array of item frequencies.  Column 1 contains item 
>>>> values,
>>>> column 2 contains their respective counts.  Assumes a 1D array is 
>>>> passed.
>>>>
>>>> Returns: a 2D frequency table (col [0:n-1]=scores, col n=frequencies)
>>>> """
>>>>   scores = _support.unique(a)
>>>>   scores = sort(scores)
>>>>   freq = zeros(len(scores))
>>>>   for i in range(len(scores)):
>>>>       freq[i] = add.reduce(equal(a,scores[i]))
>>>>   return array(_support.abut(scores, freq))
>>>> ----------------------------------------------------------------------
>>>>
>>>> It seems that add.reduce is the source for the overflow:
>>>>
>>>> In [116]:from scipy import *
>>>>
>>>> In [117]:for i in [254,255,256,257,258]:
>>>>  .....:    l=[0]*i
>>>>  .....:    print i, add.reduce(equal(l,0))
>>>>  .....:
>>>> 254 254
>>>> 255 255
>>>> 256 0
>>>> 257 1
>>>> 258 2
>>>>
>>>> Is there any possibility to avoid the overflow?
>>>
>>
>>
>> Apropos the preceding, herewith a thread on the Numpy list from a more
>> than a few months ago. The take-home message is that for integer arrays,
>> add.reduce is very fast at producing results which fall into two
>> categories: a) correct or b) incorrect due to overflow. Unfortunately
>> there is no equally quick method of determining into which of these two
>> categories any specific result returned by add.reduce falls.
>>
>> Tim C
>
Hi Tim,

There are already tools in place which let you accomplish what you 
want,  they're just not the default.   I have a few suggestions:

1.  Ask scipy newcore (Travis) for long term changes like the default 
behavior of add.reduce.

2.  Look at numarray's sum() method for better overflow behavior and 
reduction on all axes.

 >>> import numarray as na
 >>> a = na.arange(100, type='Int8')
 >>> na.add.reduce(a)   # yeah,  it overflows...
86
 >>> a.sum()    # but this doesn't (well, not if Int64 is enough) and 
always reduces on all axes at once...
4950L

3. Mimic the sum() method and specifiy the type of your output array in 
add.reduce.

 >>> na.add.reduce(a, type=na.MaximumType(a.type()))
4950L

4. Or,  just explicitly specify the type of your reduction with 
something you're happy with:

 >>> na.add.reduce(a, type='Int16')
4950

Hopefully one of these solutions will work for you.

Regards,
Todd

>>
>> From: Tim Churches <tchur at optushome.com.au>
>> To: Todd Miller <jmiller at stsci.edu>
>> Date: Apr 12 2005 - 7:51am
>>
>> Todd Miller wrote:
>>
>>> On Sun, 2005-04-10 at 10:23 +1000, Tim Churches wrote:
>>>
>>>
>>>> I just got caught by code equivalent to this (with NumPy 23.8 on 32 
>>>> bit
>>>> Linux):
>>>>
>>>>
>>>>>>> import Numeric as N
>>>>>>> a = N.array((2000000000,1000000000),typecode=N.Int32)
>>>>>>> N.add.reduce(a)
>>>>>>
>>>>
>>>> -1294967296
>>>>
>>>> OK, it is an elementary mistake, but the silent overflow caught me
>>>> unawares. casting the array to Float64 before summing it avoids the
>>>> error, but in my instance the actual data is a rank-1 array of 21
>>>> million integers with a mean value of about 140 (which adds up more 
>>>> than
>>>> sys.maxint), and casting to Float64 will use quite a lot of memory (as
>>>> well as taking some time).
>>>>
>>>> Any advice for catching or avoiding such overflow without always
>>>> incurring a performance and memory hit by always casting to Float64?
>>>
>>>
>>>
>>> Here's what numarray does:
>>>
>>>
>>>
>>>>>> import numarray as N
>>>>>> a = N.array((2000000000,1000000000),typecode=N.Int32)
>>>>>> N.add.reduce(a)
>>>>>
>>>
>>> -1294967296
>>>
>>> So basic reductions in numarray have the same "careful while you're
>>> shaving" behavior as Numeric; it's fast but easy to screw up.
>>
>>
>>
>> Sure, but how does one be careful? It seems that for any array of two
>> integers or more which could sum to more than sys.maxint or less than
>> -sys.maxint, add.reduce() in both NumPy and Numeric will give either a)
>> the correct answer or b) the incorrect answer, and short of adding up
>> the array using a safer but much slower method, there is no way of
>> determining if the answer provided (quickly) by add.reduce is right or
>> wrong? Which seems to make it fast but useless (for integer arrays, at
>> least? Is that an unfair summary? Can anyone point me towards a method
>> for using add.reduce() on small arrays of large integers with values in
>> the billions, or on large arrays of fairly small integer values, which
>> will not suddenly and without warning give the wrong answer?
>>
>>
>>> But:
>>>
>>>
>>>
>>>>>> a.sum()
>>>>>
>>>
>>> 3000000000L
>>>
>>>
>>>>>> a.sum(type='d')
>>>>>
>>>
>>> 3000000000.0
>>>
>>> a.sum() blockwise upcasts to the largest type of kind on the fly, in
>>> this case, Int64. This avoids the storage overhead of typecasting the
>>> entire array.
>>
>>
>>
>> That's on a 64-bit platform, right? The same method could be used to
>> cast the accumulator to a Float64 on a 32-bit platform to avoid casting
>> the entire array?
>>
>>
>>> A better name for the method would have been sumall() since it sums all
>>> elements of a multi-dimensional array. The flattening process reduces
>>> on one dimension before flattening preventing a full copy of a
>>> discontiguous array. It could be smarter about choosing the dimension
>>> of the initial reduction.
>>
>>
>>
>> OK, thanks. Unfortunately it is not possible for us to port our
>> application to numarray at the moment. But the insight is most helpful.
>>
>> Tim C
>
> Tim,
>
> Thank you very much for your answer. I posted two follow-ups to my own 
> post on c.l.p 
> (http://groups.google.de/group/comp.lang.python/browse_thread/thread/a96c404d73d71259/7769fca1fa562718?hl=de#7769fca1fa562718) 
>
>
> A least for scipy version '0.3.2' the problem comes directly from the 
> ufunc 'add' for bool arrays:
>
> In [178]:array(array([1],'b')*2)
> Out[178]:array([2],'i')
>
> In [179]:array(array([1],'b')+array([1],'b'))
> Out[179]:array([2],'b')
>
> 'multiply' changes the typecode.
>
> So, in this case
>
> a+a != a*2 if a is an array with bytecode 'b'.
>
> Regards
> Hans Georg Krauthäuser






More information about the Numpy-discussion mailing list