# [Numpy-discussion] non-standard standard deviation

Colin J. Williams cjw@ncf...
Sun Dec 6 10:01:13 CST 2009

```
On 04-Dec-09 10:54 AM, Bruce Southey wrote:
> On 12/04/2009 06:18 AM, yogesh karpate wrote:
>> @ Pauli and @ Colin:
>>                                   Sorry for the late reply. I was
>> busy in some other assignments.
>> # As far as  normalization by(n) is concerned then its common
>> assumption that the population is normally distributed and population
>> size is fairly large enough to fit the normal distribution. But this
>> standard deviation, when applied to a small population, tends to be
>> too low therefore it is called  as biased.
>> # The correction known as bessel correction is there for small sample
>> size std. deviation. i.e. normalization by (n-1).
>> # In "electrical-and-electronic-measurements-and-instrumentation" by
>> A.K. Sawhney . In 1st chapter of the book "Fundamentals of
>> Meausrements " . Its shown that for N=16 the std. deviation
>> normalization was (n-1)=15
>> # While I was learning statistics in my course Instructor would
>> advise to take n=20 for normalization by (n-1)
>> # Probability and statistics by Schuam Series  is good reading.
>> Regards
>> ~ymk
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
> Hi,
> Basically, all that I see with these arbitrary values is that you are
> relying on the 'central limit theorem'
> (http://en.wikipedia.org/wiki/Central_limit_theorem).  Really the
> issue in using these values is how much statistical bias will you
> tolerate especially in the impact on usage of that estimate because
> the usage of variance (such as in statistical tests) tend to be more
> influenced by bias than the estimate of variance. (Of course, many
> features rely on asymptotic properties so bias concerns are less
> apparent in large sample sizes.)
>
> Obviously the default relies on the developers background and
> requirements. There are multiple valid variance estimators in
> statistics with different denominators like N (maximum likelihood
> estimator), N-1 (restricted maximum likelihood estimator and certain
> Bayesian estimators) and Stein's
> (http://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator). So
> thecurrent default behavior is a valid and documented. Consequently
> you can not just have one option or different functions (like certain
> programs) and Numpy's implementation actually allows you do all these
> in a single function. So I also see no reason change even if I have to
> add the ddof=1 argument, after all 'Explicit is better than implicit' :-).
>
> Bruce
Bruce,

I suggest that the Central Limit Theorem is tied in with the Law of
Large Numbers.

When one has a smallish sample size, what give the best estimate of the
variance?  The Bessel Correction provides a rationale, based on
expectations: (http://en.wikipedia.org/wiki/Bessel%27s_correction).

It is difficult to understand the proof of Stein:
http://en.wikipedia.org/wiki/Proof_of_Stein%27s_example

The symbols used are not clearly stated.  He seems interested in a
decision rule for the calculation of the mean of a sample and claims
that his approach is better than the traditional Least Squares approach.

In most cases, the interest is likely to be in the variance, with a view
to establishing a confidence interval.

In the widely used Analysis of Variance (ANOVA), the degrees of freedom
are reduced for each mean estimated, see:
http://www.mnstate.edu/wasson/ed602lesson13.htm for the example below:

*Analysis of Variance Table* ** Source of
Variation 	Sum of
Squares 	Degrees of
Freedom 	Mean
Square 	F Ratio 	p
Between Groups 	25.20 	2 	12.60 	5.178 	<.05
Within Groups 	29.20 	12 	2.43

Total 	54.40 	14

There is a sample of 15 observations, which is divided into three
groups, depending on the number of hours of therapy.
Thus, the Total degrees of freedom are 15-1 = 14,  the Between Groups
3-1 = 2 and the Residual is 14 - 2 = 12.

Colin W.

```