[Numpy-discussion] histogram: sum up values in each bin
Thu Aug 27 22:37:02 CDT 2009
On Thu, Aug 27, 2009 at 1:27 PM, <email@example.com> wrote:
> On Thu, Aug 27, 2009 at 12:49 PM, Tim
> Michelsen<firstname.lastname@example.org> wrote:
>>> Tim, do you mean, that you want to apply other functions, e.g. mean or
>>> variance, to the original values but calculated per bin?
>> Sorry that I forgot to add this. Shame.
>> I would like to apply these mathematical functions on the original values
>> stacked in the respective bins.
>> For instance:
>> The sample data measures the wight of an animal.
>> 1) historam give a count of how many values are in each bin.
>> I would like to calculate the average wight of all animals
>> sorted in bin1, bin2 etc.
>> This is also useful in where you have a time component.
>> In Spreadsheets I would use a '=' to reference to the original data and then
>> either sum it up or count it per class.
>> I hope this is somehow understandable.
> Yes, it is a quite common use case for descriptive statistics, and I'm
> starting to collect different ways of doing it.
> In your case, Vincents way is the easiest.
> If you need to be faster, or you want to apply the same classification
> also to other variables, e.g. size of the animal,.., then creating a
> label array would be a more flexible solution.
> There was a similar thread recently on the scipy-user list for sorted
> arrays: "How to average different pieces or an array?"
Here is a version where bincount and histogram produce the same
results for mean and variance per bin if no bins are empty. If a bin
is empty then either some nans or some small arbitrary numbers are
# incompletely tested if a bin has zero elements, nans or missing in variance
import numpy as np
x = np.random.normal(size=100) #+ 1e5 # + 1e8 to compare precision
c, b = np.histogram(x)
sortind = np.argsort(x)
reverse_sortind = np.argsort(sortind)
xsorted = x[sortind]
bind = np.searchsorted(xsorted,b,'right')
#construct label index
ind2 = np.zeros(x.shape, int)
ind2[bind[1:-1]] = 1 # assumes boundary indices are included in y
ind = ind2.cumsum()
labels = ind[reverse_sortind] # reverse sorting
means = np.bincount(ind,xsorted)*1.0/np.bincount(ind)
count = np.bincount(labels)
means = np.bincount(labels,x)*1.0/count
#compare mean with histogram
countsPerBin = np.histogram(x)
sumsPerBin = np.histogram(x, weights=x)
averagePerBin = sumsPerBin / countsPerBin
meanarr = means[labels]
var = np.bincount(labels,(x-meanarr)**2)/count
# with histogram
squaresums_perbin = np.histogram(x, weights=x**2)
var_perbin = squaresums_perbin*1.0 / countsPerBin - averagePerBin**2
print np.array(var) - np.array(var_perbin)
More information about the NumPy-Discussion