[SciPy-user] stats review: std/var and samplestd/samplevar
Zachary Pincus
zpincus at stanford.edu
Sun Apr 2 16:39:20 CDT 2006
Hi folks -
It appears to me that the scipy.stats implementations for calculating
sample variances and population variances (and hence standard
deviations too) are somehow reversed.
Specifically, the variance of an entire population is calculated with
a denominator of the population size N. The variance of a sample from
a population is either estimated using a denominator of the sample
size n (to obtain a biased estimate) or 1-n (to obtain an unbiased
estimate). Note that saying "sample variance" does not imply the use
of the 1-n estimator, as there are cases in which the biased
estimator may legitimately be used.(*)
see e.g.:
http://en.wikipedia.org/wiki/Variance
http://en.wikipedia.org/wiki/Standard_deviation
However, scipy.stats.std and scipy.stats.var use 1-N, while
scipy.stats.samplestd and scipy.stats.samplevar use N. This is
clearly incorrect notation any way you slice it.
I would propose to have:
(1) scipy.stats.var and scipy.stats.std -- use N as the denominator
(2) scipy.stats.samplevar and scipy.stats.samplesdt -- at least use
n-1 as the denominator. Better would be to deprecate / remove them
because as above "sample variance" is ambiguous.
(3) scipy.stats.var_unbiased -- use n-1 as denominator. As per the
note below, there is no general unbiased estimator of the standard
deviation, and so there should be no scipy.stats.std_unbiased
function. (See the wikipedia entry and also http://www.itl.nist.gov/
div898/handbook/pmc/section3/pmc32.htm )
I feel vaguely that the N-1 estimator is always problematic, because
if you have a small enough sample that it makes a difference, you've
got bigger problems than using N or N-1. Not that these problems are
insurmountable, but you've got to have some statistical savvy to deal
properly with them. As such, I think that the default functions (var
and std) should just return the population statistics. But reasonable
people may disagree.
Zach Pincus
Program in Biomedical Informatics and Department of Biochemistry
Stanford University School of Medicine
(*) E.g.: While it is possible to estimate the variance in an
unbiased manner, estimating the standard deviation of a population
from a sample without bias is actually impossible without assumptions
about the population. (There is a complex correction factor for
samples from normal populations discussed on the NIST page.)
Moreover, though the (N-1)-denominated estimator of the variance is
unbiased, the estimator itself has a greater variance around the true
value than the N-denominated estimator. As such, using the unbiased
estimator can sap statistical power from some tests. This is why
sometimes one might use the N-denominated estimator for the sample
variance.
More information about the SciPy-user
mailing list