[Numpy-discussion] calculating weighted majority using two 3D arrays
Damian Eads
eads@soe.ucsc....
Thu Mar 6 21:54:05 CST 2008
Hi Gregory,
Gregory, Matthew wrote:
> Eads, Damian wrote:
>> You may need to be a bit more specific by what you mean by
>> weighted majority. What are the range of values for values
>> and weights, specifically? This sounds a lot like pixel
>> classification where each pixel is classified with a majority
>> vote over its weights and values. Is that what you're trying to do?
>>
>> Many numpy functions (e.g. mean, max, min, sum) have an axis
>> parameter, which specifies the axis along which the statistic
>> is computed. Omitting the axis parameter causes the statistic
>> to be computed over all values in the multidimensional array.
>>
>> Suppose the 'values' array contains floating point numbers in
>> the range
>> -1 to 1 and a larger absolute value gives a larger
>> confidence. Also suppose the weights are floating point
>> numbers between 0 and 1. The weighted majority vote for pixel
>> i,j over 10 real-valued (confidenced) votes, each vote having
>> a separate weight, is computed by
>>
>> w_vote = numpy.sign((values[:,i,j]*weights[:,i,j]).sum())
>>
>> This can be vectorized to give a weighted majority vote for
>> each pixel by doing
>>
>> w_vote = numpy.sign((values*weights).sum(axis=0))
>>
>> The values*weights expression gives a weighted prediction.
>> This also works if the 'values' are just predictions from the
>> set {-1, 1}, i.e.
>> there are ten classifiers, each one predicts either -1 and 1
>> on each pixel.
>
> Damian, thank you for the helpful response. I should have been a bit
> more explicit about what I meant by weighted majority. In my case, I
> need to find a discrete value (i.e. class) that occurs most often among
> ten observations where weighting is pre-determined by an
> inverse-distance calculation. Ignoring for a moment the
> multidimensionality issue, my values and weights arrays might look like
> this:
>
> values = array([14, 32, 12, 50, 2, 8, 19, 12, 19, 10])
> weights = array([0.5, 0.1, 0.6, 0.1, 0.8, 0.3, 0.8, 0.4, 0.9, 0.2])
>
> My function to calculate the majority looks like this:
>
> def weightedMajority(a, b):
>
> # Put all the samples into a dictionary with weights summed for
> # duplicate values
> wDict = {}
> for i in xrange(len(a)):
> (value, weight) = (a[i], b[i])
>
> if wDict.has_key(value):
> wDict[value] += weight
> else:
> wDict[value] = weight
>
> # Create arrays of the values and weights
> values = numpy.array(wDict.keys())
> weights = numpy.array(wDict.values())
>
> # Return the index of the maximum value
> index = numpy.argmax(weights)
>
> # Return the majority value
> return values[index]
Hi Matthew,
Keep in mind that 'for' loops are inefficient in python. This is less
worrisome when the input data sets are small. However, for larger data
sets, one must exercise a bit more care when using Python 'for' loops.
There is a lot of overhead for each iteration.
I would advise looping over the class labels, rather than the examples
since the number of class labels is in most cases significantly fewer
than the number of examples.
def weighted_majority(values, weights):
# The number of different kinds of values.
kinds = numpy.unique(values)
# The weight sums of the values.
weight_sums = numpy.zeros((len(kinds),))
# Loop over each different kind of value.
for i in xrange(0, len(kinds)):
# Grab the i'th kind of value
kind = kinds[i]
# Create a mask for the values of that kind.
kind_mask = values == kind
# Sum up the weights corresponding to the masked values.
weight_sums[i] += weights[kind_mask].sum()
#end for
# Return the kind label with the largest weight sum.
return kinds[weight_sums.argmax()]
The code above should also generalize to multidimensional arrays since
the kind_mask matches the dimensionality of both the 'values' and
'weights' variables. A caveat: I have not extensively tested this code
but it looks correct.
>
> In the above example:
>
>>> maj = weightedMajority(values, weights)
>>> maj
> 19
>
> Correct me if I'm wrong, but I don't think that your example will work
> when I am looking to return a discrete value from the values set, but
> you may see something that I'm doing that is truly inefficient!
If your predictions come from a set of nominal values (or class labels)
where order has no meaning among the class labels, and there are more
than two kinds of labels (or prediction values) then you are correct, my
example from the earlier post will not work. It only works for binary
prediction values with or without confidence ratings.
Damian
More information about the Numpy-discussion
mailing list