[Numpy-discussion] numpy histogram normed=True (bug / confusing behavior)

josef.pktd@gmai... josef.pktd@gmai...
Mon Aug 30 14:02:55 CDT 2010


On Mon, Aug 30, 2010 at 2:43 PM, Benjamin Root <ben.root@ou.edu> wrote:
> On Mon, Aug 30, 2010 at 10:50 AM, <josef.pktd@gmail.com> wrote:
>>
>> On Mon, Aug 30, 2010 at 11:39 AM, Bruce Southey <bsouthey@gmail.com>
>> wrote:
>> > On 08/30/2010 09:19 AM, Benjamin Root wrote:
>> >
>> > On Mon, Aug 30, 2010 at 8:29 AM, David Huard <david.huard@gmail.com>
>> > wrote:
>> >>
>> >> Thanks for the feedback,
>> >> As far as I understand it, the proposition is to keep histogram as it
>> >> is
>> >> for 1.5, then in 2.0, deprecate normed=True but keep the buggy
>> >> behavior,
>> >> while adding a density keyword that fixes the bug. In a later release,
>> >> we
>> >> could then get rid of normed. While the bug won't be present in
>> >> histogramdd
>> >> and histogram2d, the keyword change should be mirrored in those
>> >> functions as
>> >> well.
>> >> I personally am not too keen on changing the keyword normed for
>> >> density. I
>> >> feel we are trading clarity for a few new users against additional
>> >> trouble
>> >> for many existing users. We could mitigate this by first documenting
>> >> the
>> >> change in the docstring and live with both keywords for a few years
>> >> before
>> >> raising a DeprecationWarning.
>> >> Since this has a direct impact on matloblib's hist, I'd be keen to
>> >> hears
>> >> the devs on this.
>> >> David
>> >
>> > I am not a dev, but I would like to give a word of warning from
>> > matplotlib.
>> >
>> > In matplotlib, the bar/hist family of functions grew organically as the
>> > devs
>> > took on various requests to add keywords and such to modify the style
>> > and
>> > behavior of those graphing functions.  It has now become an
>> > unmaintainable
>> > mess, prompting discussions on how to rip it out and replace it with a
>> > cleaner implementation.  While everyone agrees that it needs to be done,
>> > we
>> > all don't want to break backwards compatibility.
>> >
>> > My personal feeling is that a function should do one thing, and do that
>> > one
>> > thing well.  So, to me, that means that histogram() should return an
>> > array
>> > of counts and the bins for those counts.  Anything more is merely window
>> > dressing to me.  With this information, one can easily compute a
>> > cumulative
>> > distribution function, and/or normalize the result.  The idea is that if
>> > there is nothing special that needs to be done within the histogram
>> > algorithm to accommodate these extra features, then they belong outside
>> > the
>> > function.
>> >
>> > My 2 cents,
>> > Ben Root
>> >
>> > _______________________________________________
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion@scipy.org
>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>> > +1 for Ben's approach.
>> > This is very similar to my view regarding to the contingency table class
>> > proposed for scipy ( http://projects.scipy.org/scipy/ticket/1258). We
>> > need
>> > to provide the core functionality that other approaches such as density
>> > estimation can use but not be limited to specific details.
>>
>> I think (a corrected) density histogram is core functionality for
>> unequal bin lengths.
>>
>> The graph with raw count in the case of unequal bin sizes would be
>> quite misleading when plotted and interpreted on the real line and not
>> on discrete points (shaded areas instead of vertical lines). And as
>> the origin of this thread showed, it's not trivial to figure out what
>> the correct normalization is.
>> So, I think, if we drop the density normalization, we just need a new
>> function that does it.
>>
>> My 2c,
>>
>> Josef
>>
>>
>
> Why not a function that takes the output of a core histogram and produces a
> correct density normalization?  Such a function would be useful elsewhere, I
> imagine.
>
> Of course there is a lot of legacy issues to consider, but if we introduce
> such a function first with documentation in histogram() showing how to
> produce a normalized density, we can then keep some of the bad code for now
> for backwards compatibility with notes saying that some of the stuff will be
> deprecated.  Especially point out in the docs where the current code fails
> to produce the correct results.

bugfix or redesign ?

My feature request for (or target for forking) the histogram functions
is to get the temporary results out, or get additional results, for
example the bin-number or quantization for each observation, or some
other things that I don't remember right now.

With histogram functions that only do histograms, we loose a lot of
calculations. This is, however, not really relevant for calculating
densities since the bin edges are returned.

Josef


>
> Ben Root
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>


More information about the NumPy-Discussion mailing list