histogram complete makeover

Erin Sheldon erin.sheldon at gmail.com
Wed Oct 18 13:00:20 CDT 2006


On 10/17/06, David Huard <david.huard at gmail.com> wrote:
> Hi all,
>
> I'd like to poll the list to see what people want from numpy.histogram(),
> since I'm currently writing a contender.
>
> My main complaints with the current version are:
> 1. upper outliers are stored in the last bin, while lower outliers are not
> counted at all,
> 2. cannot use weights.
>
> The new histogram function is well under way (it address these issues and
> adds an axis keyword),
> but I want to know what is the preferred behavior regarding the function
> output, and your
> willingness to introduce a new behavior that will break some code.
>
> Given a number of bins N and range (min, max), histogram constructs linearly
> spaced bin edges
> b0 (out-of-range)  | b1 | b2 | b3 | .... | bN | bN+1 out-of-range
> and may return:
>
> A.  H = array([N_b0, N_b1, ..., N_bN,  N_bN+1])
> The out-of-range values are the first and last values of the array. The
> returned array is hence N+2
>
> B.  H = array([N_b0 + N_b1, N_b2, ..., N_bN + N_bN+1])
> The lower and upper out-of-range values are added to the first and last bin
> respectively.
>
> C.  H = array([N_b1, ..., N_bN + N_bN+1])
> Current behavior: the upper out-of-range values are added to the last bin.
>
> D.  H = array([N_b1, N_b2, ..., N_bN]),
> Lower and upper out-of-range values are given after the histogram array.
>
> Ideally, the new function would not break the common usage: H =
> histogram(x)[0], so this exclude A.  B and C are not acceptable in my
> opinion, so only D remains, with the downsize that the outliers are not
> returned. A solution might be to add a keyword full_output=False, which when
> set to True, returns the out-of-range values in a dictionnary.
>
> Also, the current function returns -> H, ledges
>  where ledges is the array of left bin edges (N).
>  I propose returning the complete array of edges (N+1), including the
> rightmost edge. This is a little bit impractical for plotting, as the edges
> array does not have the same length as the histogram array, but allows the
> use of user-defined non-uniform bins.
>
> Opinions, suggestions ?

I dislike the current behavior.  I don't want the histogram
to count anything outside the range I specify.

It would also be nice to allow specification of a binsize
which would be used if number of bins wasn't sent.

Personally, since I don't have any code yet that uses
histogram, I feel like edges could be returned in a
keyword.  Perhaps in a dictionary with other useful items, such
as bin middles, mean of the data in bins and other statistics, or
whatever, which would only be calculated if the keyword
dict was sent.

Hopefully Google and sourceforge are playing nice and
you will see this within a day of sending.
Erin

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642




More information about the Numpy-discussion mailing list