[Numpy-discussion] Ticket #605 Incorrect behavior of numpy.histogram

David Huard david.huard@gmail....
Tue Apr 8 14:25:00 CDT 2008


2008/4/8, Bruce Southey <bsouthey@gmail.com>:
>
> Hi,
> I agree that the current histogram should be changed. However, I am not
> sure 1.0.5 is the correct release for that.


We both agree.

David, this doesn't work for your code:
> r= np.array([1,2,2,3,3,3,4,4,4,4,5,5,5,5,5])
> dbin=[2,3,4]
> rc, rb=histogram(r, bins=dbin, discard=None)

Returns:
> rc=[3 3] # Really should be [3, 3, 9]
> rb=[-9223372036854775808                    3 -9223372036854775808]


I used the convention that bins are the bin edges, including the right most
edge, this is why len(rc) =2 and len(rb)=3.

Now there clearly is a bug, and I traced it to the use of np.r_. Check this
out:

In [26]: dbin = [1,2,3]

In [27]: np.r_[-np.inf, dbin, np.inf]
Out[27]: array([-Inf,   1.,   2.,   3.,  Inf])

In [28]: np.r_[-np.inf, asarray(dbin), np.inf]
Out[28]:
array([-9223372036854775808,                    1,
2,                          3, -9223372036854775808])

In [29]: np.r_[-np.inf, asarray(dbin).astype(float), np.inf]
Out[29]: array([-Inf,   1.,   2.,   3.,  Inf])

Is this a misuse of r_ or a bug ?


David








But I have not had time to find the error.
>
> Regards
> Bruce
>
>
>
> David Huard wrote:
> > Hans,
> >
> > Note that the current histogram is buggy, in the sense that it assumes
> > that all bins have the same width and computes db = bins[1]-bin[0].
> > This is why you get zeros everywhere.
> >
> > The current behavior has been heavily criticized and I think we should
> > change it. My proposal is to have for histogram the same behavior as
> > for histogramdd and histogram2d: bins are the bin edges, including the
> > rightmost bin, and values outside of the bins are not tallied. The
> > problem with this is that it breaks code, and I'm not sure it's such a
> > good idea to do this in a point release.
> >
> > My short term proposal would be to fix the normalization bug and
> > document the current behavior of histogram for the 1.0.5 release. Once
> > it's done, we can modify histogram and maybe print a warning the first
> > time it's used to notice users of the change.
> >
> > I'd like to hear the voice of experienced devs on this. This issue has
> > been raised a number of times since I follow this ML. It's not the
> > first time I've proposed patches, and I've already documented the
> > weird behavior only to see the comments disappear after a while. I
> > hope this time some kind of agreement will be reached.
> >
> > Regards,
> >
> > David
> >
> >
> >
> >
> > 2008/4/8, Hans Meine <meine@informatik.uni-hamburg.de
>
> > <mailto:meine@informatik.uni-hamburg.de>>:
>
> >
> >     Am Montag, 07. April 2008 14:34:08 schrieb Hans Meine:
> >
> >     > Am Samstag, 05. April 2008 21:54:27 schrieb Anne Archibald:
> >     > > There's also a fourth option - raise an exception if any
> >     points are
> >     > > outside the range.
> >     >
> >     > +1
> >     >
> >     > I think this should be the default.  Otherwise, I tend towards
> >     "exclude",
> >     > in order to have comparable bin sizes (when plotting, I always
> >     find peaks
> >     > at the ends annoying); this could also be called "clip" BTW.
> >     >
> >     > But really, an exception would follow the Zen: "In the face of
> >     ambiguity,
> >     > refuse the temptation to guess."  And with a kwarg: "Explicit is
> >     better
> >     > than implicit."
> >
> >
> >     When posting this, I did indeed not think this through fully; as
> >     David (and
> >     Tommy) pointed out, this API does not fit well with the existing
> >     `bins`
> >     option, especially when a sequence of bin bounds is given.  (I
> >     guess I was
> >     mostly thinking about the special case of discrete values and 1:1
> >     bins, as
> >     typical for uint8 data.)
> >
> >     Thus, I would like to withdraw my above opinion from and instead
> >     state that I
> >     find the current API as clear as it gets.  If you want to exclude
> >     values,
> >     simply pass an additional right bound, and for including outliers,
> >     passing -inf as additional left bound seems to do the trick.  This
> >     could be
> >     possibly added to the documentation though.
> >
> >     The only critical aspect I see is the `normed` arg.  As it is now,
> the
> >     rightmost bin has always infinite size, but it is not treated like
> >     that:
> >
> >     In [1]: from numpy import *
> >
> >     In [2]: histogram(arange(10), [2,3,4], normed = True)
> >     Out[2]: (array([ 0.1,  0.1,  0.6]), array([2, 3, 4]))
> >
> >     Even worse, if you try to add an infinite bin to the left, this
> >     pulls all
> >     values to zero (technically, I understand that, but it looks really
> >     undesirable to me):
> >
> >     In [3]: histogram(arange(10), [-inf, 2,3,4], normed = True)
> >     Out[3]: (array([ 0.,  0.,  0.,  0.]), array([-Inf,   2.,   3.,
> 4.]))
> >
> >
> >     --
> >     Ciao, /  /
> >          /--/
> >         /  / ANS
> >     _______________________________________________
> >     Numpy-discussion mailing list
>
> >     Numpy-discussion@scipy.org <mailto:Numpy-discussion@scipy.org>
>
> >     http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
>
> > ------------------------------------------------------------------------
>
> >
> > _______________________________________________
> > Numpy-discussion mailing list
> > Numpy-discussion@scipy.org
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://projects.scipy.org/pipermail/numpy-discussion/attachments/20080408/2039d838/attachment-0001.html 


More information about the Numpy-discussion mailing list