# [Numpy-discussion] weighted mean; weighted standard error of the mean (sem)

josef.pktd@gmai... josef.pktd@gmai...
Fri Sep 10 14:01:56 CDT 2010

```On Fri, Sep 10, 2010 at 1:58 PM, Christopher Barrington-Leigh
<cpblpublic+numpy@gmail.com> wrote:
> Interesting. Thanks Erin, Josef and Keith.

thanks to the stata page at least I figured out that WLS is aweights
with asumption mu_i = mu

import numpy as np
from scikits.statsmodels import WLS
w0 = np.arange(20) % 4
w = 1.*w0/w0.sum()
y = 2 + np.random.randn(20)

>>> res = WLS(y, np.ones(20), weights=w).fit()
>>> print res.params, res.bse
[ 2.29083069] [ 0.17562867]
>>> m = np.dot(w, y)
>>> m
2.2908306865128401
>>> s2u = 1/(nobs-1.) * np.dot(w, (y - m)**2)
>>> s2u
0.030845429945278956
>>> np.sqrt(s2u)
0.17562867062435722

>
> There is a nice article on this at
> http://www.stata.com/support/faqs/stat/supweight.html. In my case, the
> model I've in mind is to assume that the expected value (mean) is the same
> for each sample, and that the weights are/should be normalised, whence a
> consistent estimator for sem is straightforward (if second moments can
> be assumed to be
> well behaved?). I suspect that this (survey-like) case is also one of
> the two most standard/most common
> expression that people want when they ask for an s.e. of the mean for
> a weighted dataset. The other would be when the weights are not to be
> normalised, but represent standard errors on the individual
> measurements.
>
> Surely what one wants, in the end, is a single function (or whatever)
> called mean or sem which calculates different values for different
> specified choices of model (assumptions)? And where possible that it has a
> default model in mind for when none is specified?

I find aweights and pweights still confusing, plus necessary auxillary
assumptions.

I don't find Stata docs very helpful, I almost never find a clear
description of the formulas (and I don't have any Stata books).

If you have or write some examples that show or apply in the different
cases, then this would be very helpful to get a structure into this
area, weighting and survey sampling, and population versus clustered
or stratified sample statistics.

I'm still pretty lost with the literature on surveys.

Josef

>
> thanks,
> Chris
>
> On Thu, Sep 9, 2010 at 9:13 PM, Keith Goodman <kwgoodman@gmail.com> wrote:
>> >>>> ma.std()
>> >>   3.2548815339711115
>> >
>> > or maybe `w` reflects an underlying sampling scheme and you should
>> > sample in the bootstrap according to w ?
>>
>> Yes....
>>
>> > if weighted average is a sum of linear functions of (normal)
>> > distributed random variables, it still depends on whether the
>> > individual observations have the same or different variances, e.g.
>> > http://en.wikipedia.org/wiki/Weighted_mean#Statistical_properties
>>
>> ...lots of possibilities. As you have shown the problem is not yet
>> well defined. Not much specification needed for the weighted mean,
>> lots needed for the standard error of the weighted mean.
>>
>> > What I can't figure out is whether if you assume simga_i = sigma for
>> > all observation i, do we use the weighted or the unweighted variance
>> > to get an estimate of sigma. And I'm not able to replicate with simple
>> > calculations what statsmodels.WLS gives me.
>>
>> My guess: if all you want is sigma of the individual i and you know
>> sigma is the same for all i, then I suppose you don't care about the
>> weight.
>>
>> >
>> > ???
>> >
>> > Josef
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
```