# [SciPy-user] wassup with histogram?

danny shevitz danny_shevitz at yahoo.com
Wed Apr 28 15:25:53 CDT 2004

```So I tried to use stats.histogram and I believe I found a bug, but I
also have a bigger wtf with the algorithm.

If I create a random vector in [0,1). I get uneven counts. Here it is:

####################################################
import scipy
from scipy.stats import histogram
from random import random

n = 100
randArray = [random() for i in range(n)]
####################################################

gives

(array([15, 17, 22, 20, 20,  6,  0,  0,  0,  0]),
-0.094840236656285076, 0.20868271373704711, 0)

Notice that the last entries in the histogram are all zeros. This is
always the case. Also notice the bin width is about .2 which is
approximately double what it should be. I have traced the error and in
the stats.py module the code

estbinwidth = float(Max - Min)/float(numbins) + 1
binsize = (Max-Min+estbinwidth)/float(numbins)

computes the bin size incorrectly. In particular the +1 in estbinwidth
needs to be in parentheses. You would only notice this for small data
ranges which is perhaps why it was never noticed. Personally, I would
just compute the binsize in one step as

binsize = (numbins+1)(Max-Min+estbinwidth)/numbins**2.

But now for the wtf part. The histogram centers the lowest and highest
bins around the lowest and highest point in the data as witnessed in
the code

lowerreallimit = Min - binsize/2.0

and the fact that the bin size isn't just
(max - min)/n.

Why would you possibly want to do this? With this technique, if you
histogram a sample of uniform random variates, then the outer two bins
will have half the counts of the other bins because only half the bin
is within range.

There must be a reason for doing this, but I sure don't know what it
is.

D

__________________________________
Do you Yahoo!?
Win a \$20,000 Career Makeover at Yahoo! HotJobs
http://hotjobs.sweepstakes.yahoo.com/careermakeover

```