[Numpy-discussion] Re: scipy.stats.itemfreq: overflow with add.reduce
Dr. Hans Georg Krauthaeuser
Hans-Georg.Krauthaeuser at E-Technik.Uni-Magdeburg.DE
Wed Dec 21 23:54:04 CST 2005
Tim Churches schrieb:
> Hans Georg Krauthaeuser wrote:
>
>>Hans Georg Krauthaeuser schrieb:
>>
>>
>>>Hi All,
>>>
>>>I was playing with scipy.stats.itemfreq when I observed the following
>>>overflow:
>>>
>>>In [119]:for i in [254,255,256,257,258]:
>>> .....: l=[0]*i
>>> .....: print i, stats.itemfreq(l), l.count(0)
>>> .....:
>>>254 [ [ 0 254]] 254
>>>255 [ [ 0 255]] 255
>>>256 [ [0 0]] 256
>>>257 [ [0 1]] 257
>>>258 [ [0 2]] 258
>>>
>>>itemfreq is pretty small (in stats.py):
>>>
>>>----------------------------------------------------------------------
>>>def itemfreq(a):
>>> """
>>>Returns a 2D array of item frequencies. Column 1 contains item values,
>>>column 2 contains their respective counts. Assumes a 1D array is passed.
>>>
>>>Returns: a 2D frequency table (col [0:n-1]=scores, col n=frequencies)
>>>"""
>>> scores = _support.unique(a)
>>> scores = sort(scores)
>>> freq = zeros(len(scores))
>>> for i in range(len(scores)):
>>> freq[i] = add.reduce(equal(a,scores[i]))
>>> return array(_support.abut(scores, freq))
>>>----------------------------------------------------------------------
>>>
>>>It seems that add.reduce is the source for the overflow:
>>>
>>>In [116]:from scipy import *
>>>
>>>In [117]:for i in [254,255,256,257,258]:
>>> .....: l=[0]*i
>>> .....: print i, add.reduce(equal(l,0))
>>> .....:
>>>254 254
>>>255 255
>>>256 0
>>>257 1
>>>258 2
>>>
>>>Is there any possibility to avoid the overflow?
>
>
> Apropos the preceding, herewith a thread on the Numpy list from a more
> than a few months ago. The take-home message is that for integer arrays,
> add.reduce is very fast at producing results which fall into two
> categories: a) correct or b) incorrect due to overflow. Unfortunately
> there is no equally quick method of determining into which of these two
> categories any specific result returned by add.reduce falls.
>
> Tim C
>
> From: Tim Churches <tchur at optushome.com.au>
> To: Todd Miller <jmiller at stsci.edu>
> Date: Apr 12 2005 - 7:51am
>
> Todd Miller wrote:
>
>>On Sun, 2005-04-10 at 10:23 +1000, Tim Churches wrote:
>>
>>
>>>I just got caught by code equivalent to this (with NumPy 23.8 on 32 bit
>>>Linux):
>>>
>>>
>>>>>>import Numeric as N
>>>>>>a = N.array((2000000000,1000000000),typecode=N.Int32)
>>>>>>N.add.reduce(a)
>>>
>>>-1294967296
>>>
>>>OK, it is an elementary mistake, but the silent overflow caught me
>>>unawares. casting the array to Float64 before summing it avoids the
>>>error, but in my instance the actual data is a rank-1 array of 21
>>>million integers with a mean value of about 140 (which adds up more than
>>>sys.maxint), and casting to Float64 will use quite a lot of memory (as
>>>well as taking some time).
>>>
>>>Any advice for catching or avoiding such overflow without always
>>>incurring a performance and memory hit by always casting to Float64?
>>
>>
>>Here's what numarray does:
>>
>>
>>
>>>>>import numarray as N
>>>>>a = N.array((2000000000,1000000000),typecode=N.Int32)
>>>>>N.add.reduce(a)
>>
>>-1294967296
>>
>>So basic reductions in numarray have the same "careful while you're
>>shaving" behavior as Numeric; it's fast but easy to screw up.
>
>
> Sure, but how does one be careful? It seems that for any array of two
> integers or more which could sum to more than sys.maxint or less than
> -sys.maxint, add.reduce() in both NumPy and Numeric will give either a)
> the correct answer or b) the incorrect answer, and short of adding up
> the array using a safer but much slower method, there is no way of
> determining if the answer provided (quickly) by add.reduce is right or
> wrong? Which seems to make it fast but useless (for integer arrays, at
> least? Is that an unfair summary? Can anyone point me towards a method
> for using add.reduce() on small arrays of large integers with values in
> the billions, or on large arrays of fairly small integer values, which
> will not suddenly and without warning give the wrong answer?
>
>
>>But:
>>
>>
>>
>>>>>a.sum()
>>
>>3000000000L
>>
>>
>>>>>a.sum(type='d')
>>
>>3000000000.0
>>
>>a.sum() blockwise upcasts to the largest type of kind on the fly, in
>>this case, Int64. This avoids the storage overhead of typecasting the
>>entire array.
>
>
> That's on a 64-bit platform, right? The same method could be used to
> cast the accumulator to a Float64 on a 32-bit platform to avoid casting
> the entire array?
>
>
>>A better name for the method would have been sumall() since it sums all
>>elements of a multi-dimensional array. The flattening process reduces
>>on one dimension before flattening preventing a full copy of a
>>discontiguous array. It could be smarter about choosing the dimension
>>of the initial reduction.
>
>
> OK, thanks. Unfortunately it is not possible for us to port our
> application to numarray at the moment. But the insight is most helpful.
>
> Tim C
Tim,
Thank you very much for your answer. I posted two follow-ups to my own
post on c.l.p
(http://groups.google.de/group/comp.lang.python/browse_thread/thread/a96c404d73d71259/7769fca1fa562718?hl=de#7769fca1fa562718)
A least for scipy version '0.3.2' the problem comes directly from the
ufunc 'add' for bool arrays:
In [178]:array(array([1],'b')*2)
Out[178]:array([2],'i')
In [179]:array(array([1],'b')+array([1],'b'))
Out[179]:array([2],'b')
'multiply' changes the typecode.
So, in this case
a+a != a*2 if a is an array with bytecode 'b'.
Regards
Hans Georg Krauthäuser
--
Otto-von-Guericke-Universitaet Magdeburg
IGET | fon: +49 391 67 12195
Postfach 4120 | fax: +49 391 67 11236
39016 Magdeburg | email: hgk at et.Uni-Magdeburg.DE
Germany | www: www.uni-magdeburg.de/krauthae
More information about the Numpy-discussion
mailing list