[SciPy-user] information on statistical functions
Wed Dec 17 21:11:11 CST 2008
On Wed, Dec 17, 2008 at 20:32, <firstname.lastname@example.org> wrote:
> On Wed, Dec 17, 2008 at 9:03 PM, Robert Kern <email@example.com> wrote:
>> On Wed, Dec 17, 2008 at 19:53, <firstname.lastname@example.org> wrote:
>>> On Wed, Dec 17, 2008 at 7:58 PM, Tim Michelsen
>>> <email@example.com> wrote:
>>>> I observed that there are 2 standard deviation functions in the
>>>> scipy/numpy modules:
>>>> What is the difference?
>>>> There is no formula included within the docstrings.
>>>> I suppose that np.std() is for the whole population and scipy.std is
>>>> designed for a smaller sample in the population.
>>>> Is that true?
>>> difference between population (numpy) and sample (scipy.stats)
>>> variance and standard deviation is whether the the estimator is
>>> biased, i.e. 1/n, or not, i.e. 1/(n-1).
>> It's a shame that the "biased/unbiased" terminology still survives in
>> the numpy.std() docstring. It's really quite wrong.
> I find talking about biased versus unbiased estimator much clearer
> than the population - sample distinction, and degrees of freedom might
> be more descriptive but its meaning, I guess, relies on knowing about
> the (asymptotic) distribution of the estimator, which I always forget
> and have to look up.
The problem is that the "unbiased" estimate for the standard deviation
is *not* the square root of the "unbiased" estimate for the variance.
The latter is what numpy.std(x, ddof=1) calculates, not the former.
This problem arises because of a pretty narrow concept of "error" that
gets misapplied to the variance estimator. The usual "error" that gets
used is the arithmetic difference between the estimate and the true
value of the parameter (p_est - p_true). For parameters like means,
this is usually fine, but for so-called scale parameters like
variance, it's quite inappropriate. For example, the arithmetic error
between a true value of 1.0 (in whatever units) and an estimate of 2.0
is the same as that between 101.0 and 102.0. When you drop a square
root into that formula, you don't get the same answers out when you
seek the estimator that sets the bias to 0.
Rather, a much more appropriate error measure for variance would be
the log-ratio: log(p_est/p_true). That way, 1.0 and 2.0 would be the
same distance from each other as 100.0 and 200.0. Using this measure
of error to define bias, the unbiased estimate of the standard
deviation actually is the square root of the unbiased estimate of the
variance, too, thanks to the magic of logarithms. Unfortunately for
those who want to call the (n-1) version "unbiased", the unbiased
estimator (for normal distributions, at least) uses (n-2). Oops! Other
distributions have different optimal denominators: heavier-tailed
distributions tend towards (n-3), finite-support distributions tend
But of course, bias is not the only thing to be concerned about. The
bias is just the arithmetic average of the errors. If you want to
minimize the total spread of the errors sum(abs(err)), too, that's
another story. With the arithmetic error metric, the unbiased
estimator of the variance uses (n-1) while the estimator with the
smallest total error uses (n). With the log-ratio error metric, the
unbiased estimator is the same as the one that minimizes the total
error. Happy days!
I also find the population/sample distinction to be bogus, too. IIRC,
there are even some sources which switch around the meanings, too. In
any case, the docstring should have a section saying, "If you are
looking for what is called the "unbiased" or "sample" estimate of
variance, use ddof=1." Those terms are widely, if incorrectly, used,
so we should mention them. I just find it disheartening that the terms
are used without qualification.
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
More information about the SciPy-user