[Numpy-discussion] Ticket #605 Incorrect behavior of numpy.histogram

Bruce Southey bsouthey@gmail....
Wed Apr 9 09:50:35 CDT 2008


Hi,
I should have asked first (I hope that you don't mind), but I created a 
ticket Ticket #728 (http://scipy.org/scipy/numpy/ticket/728 ) for 
numpy.r_ because this incorrectly casts based on the array types.

The bug is that -inf and inf are numpy floats but dbin is an array of 
ints. Unfortunately, numpy.r_ returns the type of the array with the 
highest precision (well at least 'float' wins over 'int') and thus there 
is lost precision. The fix is as you indicated below.

Regards
Bruce




David Huard wrote:
>
>
> 2008/4/8, Bruce Southey <bsouthey@gmail.com <mailto:bsouthey@gmail.com>>:
>
>     Hi,
>     I agree that the current histogram should be changed. However, I
>     am not
>     sure 1.0.5 is the correct release for that.
>
>
> We both agree. 
>
>     David, this doesn't work for your code:
>     r= np.array([1,2,2,3,3,3,4,4,4,4,5,5,5,5,5])
>     dbin=[2,3,4]
>     rc, rb=histogram(r, bins=dbin, discard=None)
>
>     Returns:
>     rc=[3 3] # Really should be [3, 3, 9]
>     rb=[-9223372036854775808                    3 -9223372036854775808]
>
>
> I used the convention that bins are the bin edges, including the right 
> most edge, this is why len(rc) =2 and len(rb)=3.
>
> Now there clearly is a bug, and I traced it to the use of np.r_. Check 
> this out:
>
> In [26]: dbin = [1,2,3]
>
> In [27]: np.r_[-np.inf, dbin, np.inf]
> Out[27]: array([-Inf,   1.,   2.,   3.,  Inf])
>
> In [28]: np.r_[-np.inf, asarray(dbin), np.inf]
> Out[28]:
> array([-9223372036854775808,                    1,                    
> 2,                          3, -9223372036854775808])
>
> In [29]: np.r_[-np.inf, asarray(dbin).astype(float), np.inf]
> Out[29]: array([-Inf,   1.,   2.,   3.,  Inf])
>
> Is this a misuse of r_ or a bug ?
>
>
> David
>
>
>
>
>
>
>  
>
>     But I have not had time to find the error.
>
>     Regards
>     Bruce
>
>
>
>     David Huard wrote:
>     > Hans,
>     >
>     > Note that the current histogram is buggy, in the sense that it
>     assumes
>     > that all bins have the same width and computes db = bins[1]-bin[0].
>     > This is why you get zeros everywhere.
>     >
>     > The current behavior has been heavily criticized and I think we
>     should
>     > change it. My proposal is to have for histogram the same behavior as
>     > for histogramdd and histogram2d: bins are the bin edges,
>     including the
>     > rightmost bin, and values outside of the bins are not tallied. The
>     > problem with this is that it breaks code, and I'm not sure it's
>     such a
>     > good idea to do this in a point release.
>     >
>     > My short term proposal would be to fix the normalization bug and
>     > document the current behavior of histogram for the 1.0.5
>     release. Once
>     > it's done, we can modify histogram and maybe print a warning the
>     first
>     > time it's used to notice users of the change.
>     >
>     > I'd like to hear the voice of experienced devs on this. This
>     issue has
>     > been raised a number of times since I follow this ML. It's not the
>     > first time I've proposed patches, and I've already documented the
>     > weird behavior only to see the comments disappear after a while. I
>     > hope this time some kind of agreement will be reached.
>     >
>     > Regards,
>     >
>     > David
>     >
>     >
>     >
>     >
>     > 2008/4/8, Hans Meine <meine@informatik.uni-hamburg.de
>     <mailto:meine@informatik.uni-hamburg.de>
>
>     > <mailto:meine@informatik.uni-hamburg.de
>     <mailto:meine@informatik.uni-hamburg.de>>>:
>
>     >
>     >     Am Montag, 07. April 2008 14:34:08 schrieb Hans Meine:
>     >
>     >     > Am Samstag, 05. April 2008 21:54:27 schrieb Anne Archibald:
>     >     > > There's also a fourth option - raise an exception if any
>     >     points are
>     >     > > outside the range.
>     >     >
>     >     > +1
>     >     >
>     >     > I think this should be the default.  Otherwise, I tend towards
>     >     "exclude",
>     >     > in order to have comparable bin sizes (when plotting, I always
>     >     find peaks
>     >     > at the ends annoying); this could also be called "clip" BTW.
>     >     >
>     >     > But really, an exception would follow the Zen: "In the face of
>     >     ambiguity,
>     >     > refuse the temptation to guess."  And with a kwarg:
>     "Explicit is
>     >     better
>     >     > than implicit."
>     >
>     >
>     >     When posting this, I did indeed not think this through fully; as
>     >     David (and
>     >     Tommy) pointed out, this API does not fit well with the existing
>     >     `bins`
>     >     option, especially when a sequence of bin bounds is given.  (I
>     >     guess I was
>     >     mostly thinking about the special case of discrete values
>     and 1:1
>     >     bins, as
>     >     typical for uint8 data.)
>     >
>     >     Thus, I would like to withdraw my above opinion from and instead
>     >     state that I
>     >     find the current API as clear as it gets.  If you want to
>     exclude
>     >     values,
>     >     simply pass an additional right bound, and for including
>     outliers,
>     >     passing -inf as additional left bound seems to do the
>     trick.  This
>     >     could be
>     >     possibly added to the documentation though.
>     >
>     >     The only critical aspect I see is the `normed` arg.  As it
>     is now, the
>     >     rightmost bin has always infinite size, but it is not
>     treated like
>     >     that:
>     >
>     >     In [1]: from numpy import *
>     >
>     >     In [2]: histogram(arange(10), [2,3,4], normed = True)
>     >     Out[2]: (array([ 0.1,  0.1,  0.6]), array([2, 3, 4]))
>     >
>     >     Even worse, if you try to add an infinite bin to the left, this
>     >     pulls all
>     >     values to zero (technically, I understand that, but it looks
>     really
>     >     undesirable to me):
>     >
>     >     In [3]: histogram(arange(10), [-inf, 2,3,4], normed = True)
>     >     Out[3]: (array([ 0.,  0.,  0.,  0.]), array([-Inf,   2.,  
>     3.,   4.]))
>     >
>     >
>     >     --
>     >     Ciao, /  /
>     >          /--/
>     >         /  / ANS
>     >     _______________________________________________
>     >     Numpy-discussion mailing list
>
>     >     Numpy-discussion@scipy.org
>     <mailto:Numpy-discussion@scipy.org>
>     <mailto:Numpy-discussion@scipy.org
>     <mailto:Numpy-discussion@scipy.org>>
>
>     >     http://projects.scipy.org/mailman/listinfo/numpy-discussion
>     >
>     >
>
>     >
>     ------------------------------------------------------------------------
>
>     >
>     > _______________________________________________
>     > Numpy-discussion mailing list
>     > Numpy-discussion@scipy.org <mailto:Numpy-discussion@scipy.org>
>     > http://projects.scipy.org/mailman/listinfo/numpy-discussion
>     >
>
>     _______________________________________________
>     Numpy-discussion mailing list
>     Numpy-discussion@scipy.org <mailto:Numpy-discussion@scipy.org>
>     http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>   



More information about the Numpy-discussion mailing list