[Numpy-discussion] numpy histogram normed=True (bug / confusing behavior)

David Huard david.huard@gmail....
Mon Aug 30 14:44:17 CDT 2010


On Mon, Aug 30, 2010 at 3:02 PM, <josef.pktd@gmail.com> wrote:

> On Mon, Aug 30, 2010 at 2:43 PM, Benjamin Root <ben.root@ou.edu> wrote:
> > On Mon, Aug 30, 2010 at 10:50 AM, <josef.pktd@gmail.com> wrote:
> >>
> >> On Mon, Aug 30, 2010 at 11:39 AM, Bruce Southey <bsouthey@gmail.com>
> >> wrote:
> >> > On 08/30/2010 09:19 AM, Benjamin Root wrote:
> >> >
> >> > On Mon, Aug 30, 2010 at 8:29 AM, David Huard <david.huard@gmail.com>
> >> > wrote:
> >> >>
> >> >> Thanks for the feedback,
> >> >> As far as I understand it, the proposition is to keep histogram as it
> >> >> is
> >> >> for 1.5, then in 2.0, deprecate normed=True but keep the buggy
> >> >> behavior,
> >> >> while adding a density keyword that fixes the bug. In a later
> release,
> >> >> we
> >> >> could then get rid of normed. While the bug won't be present in
> >> >> histogramdd
> >> >> and histogram2d, the keyword change should be mirrored in those
> >> >> functions as
> >> >> well.
> >> >> I personally am not too keen on changing the keyword normed for
> >> >> density. I
> >> >> feel we are trading clarity for a few new users against additional
> >> >> trouble
> >> >> for many existing users. We could mitigate this by first documenting
> >> >> the
> >> >> change in the docstring and live with both keywords for a few years
> >> >> before
> >> >> raising a DeprecationWarning.
> >> >> Since this has a direct impact on matloblib's hist, I'd be keen to
> >> >> hears
> >> >> the devs on this.
> >> >> David
> >> >
> >> > I am not a dev, but I would like to give a word of warning from
> >> > matplotlib.
> >> >
> >> > In matplotlib, the bar/hist family of functions grew organically as
> the
> >> > devs
> >> > took on various requests to add keywords and such to modify the style
> >> > and
> >> > behavior of those graphing functions.  It has now become an
> >> > unmaintainable
> >> > mess, prompting discussions on how to rip it out and replace it with a
> >> > cleaner implementation.  While everyone agrees that it needs to be
> done,
> >> > we
> >> > all don't want to break backwards compatibility.
> >> >
> >> > My personal feeling is that a function should do one thing, and do
> that
> >> > one
> >> > thing well.  So, to me, that means that histogram() should return an
> >> > array
> >> > of counts and the bins for those counts.  Anything more is merely
> window
> >> > dressing to me.  With this information, one can easily compute a
> >> > cumulative
> >> > distribution function, and/or normalize the result.  The idea is that
> if
> >> > there is nothing special that needs to be done within the histogram
> >> > algorithm to accommodate these extra features, then they belong
> outside
> >> > the
> >> > function.
> >> >
> >> > My 2 cents,
> >> > Ben Root
> >> >
> >> > _______________________________________________
> >> > NumPy-Discussion mailing list
> >> > NumPy-Discussion@scipy.org
> >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >> >
> >> > +1 for Ben's approach.
> >> > This is very similar to my view regarding to the contingency table
> class
> >> > proposed for scipy ( http://projects.scipy.org/scipy/ticket/1258). We
> >> > need
> >> > to provide the core functionality that other approaches such as
> density
> >> > estimation can use but not be limited to specific details.
> >>
> >> I think (a corrected) density histogram is core functionality for
> >> unequal bin lengths.
> >>
> >> The graph with raw count in the case of unequal bin sizes would be
> >> quite misleading when plotted and interpreted on the real line and not
> >> on discrete points (shaded areas instead of vertical lines). And as
> >> the origin of this thread showed, it's not trivial to figure out what
> >> the correct normalization is.
> >> So, I think, if we drop the density normalization, we just need a new
> >> function that does it.
> >>
> >> My 2c,
> >>
> >> Josef
> >>
> >>
> >
> > Why not a function that takes the output of a core histogram and produces
> a
> > correct density normalization?  Such a function would be useful
> elsewhere, I
> > imagine.
> >
> > Of course there is a lot of legacy issues to consider, but if we
> introduce
> > such a function first with documentation in histogram() showing how to
> > produce a normalized density, we can then keep some of the bad code for
> now
> > for backwards compatibility with notes saying that some of the stuff will
> be
> > deprecated.  Especially point out in the docs where the current code
> fails
> > to produce the correct results.
>
> bugfix or redesign ?
>
> My feature request for (or target for forking) the histogram functions
> is to get the temporary results out, or get additional results, for
> example the bin-number or quantization for each observation, or some
> other things that I don't remember right now.
>
> With histogram functions that only do histograms, we loose a lot of
> calculations. This is, however, not really relevant for calculating
> densities since the bin edges are returned.
>
>
Not sure I'm understanding what you mean by this, but if you look at the
code, you'll see that histogram is basically a big wrapper around a
one-liner: np.diff(np.searchsorted(np.sort(data), bins)). Most of the code
is there to make this one-liner user-friendly, improve performance or handle
weights.

I just added a warning alerting concerned users (r8674), so this takes care
of the bug fix and Nils wish to avoid a silent change in behavior. These two
changes could be included in 1.5 if Ralf feels this is worthwhile.

Cheers,

David H.



> Josef
>
>
> >
> > Ben Root
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20100830/167b94e0/attachment.html 


More information about the NumPy-Discussion mailing list