[SciPy-user] information on statistical functions
jh@physics.uc...
jh@physics.uc...
Thu Dec 18 10:31:54 CST 2008
Robert et al.,
I agree that the formulae implemented should go in all the stats
docstrings.
Most users of np.std() and friends will not be aware of the
distinctions you raise here, but they should be. Could you add a
warning sentence in the relevant docstrings stating that these
functions are often misapplied, that their proper application is not
always obvious, and that readers should consult appropriate references
if not already familiar with concepts xxx, yyy, and zzz (pick some)?
The sentence could point to references given either in that docstring
or (for scipy functions) in the scipy.stats docstring. The references
should include something reputable and stable online (like Wikipedia,
if it is in this case). Obviously, np docstrings should not point to
references given only in scipy docs, though they can refer to routines
in scipy or elsewhere as alternatives.
A discussion based on Robert's explanation (below) could also go on
the web site somewhere, like in Cookbook, and be referenced in the
docstrings.
I'm more of a customer for this text than one who could write it, but
if you don't have time to do it yourself, could you provide the
sentence and references? A doc volunteer can then go put it in the
relevant routine docs.
Thanks,
--jh--
Date: Wed, 17 Dec 2008 21:11:11 -0600
From: "Robert Kern" <robert.kern@gmail.com>
Subject: Re: [SciPy-user] information on statistical functions
To: "SciPy Users List" <scipy-user@scipy.org>
Message-ID:
<3d375d730812171911m4f71ff6btcf9a9cff581254a4@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
On Wed, Dec 17, 2008 at 20:32, <josef.pktd@gmail.com> wrote:
> On Wed, Dec 17, 2008 at 9:03 PM, Robert Kern <robert.kern@gmail.com> wrote:
>> On Wed, Dec 17, 2008 at 19:53, <josef.pktd@gmail.com> wrote:
>>> On Wed, Dec 17, 2008 at 7:58 PM, Tim Michelsen
>>> <timmichelsen@gmx-topmail.de> wrote:
>>>> Hello,
>>>> I observed that there are 2 standard deviation functions in the
>>>> scipy/numpy modules:
>>>>
>>>> Numpy:
>>>> http://docs.scipy.org/doc/numpy/reference/generated/numpy.std.html#numpy.std
>>>>
>>>> Scipy:
>>>> http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.std.html#scipy.stats.std
>>>>
>>>> What is the difference?
>>>> There is no formula included within the docstrings.
>>>>
>>>> I suppose that np.std() is for the whole population and scipy.std is
>>>> designed for a smaller sample in the population.
>>>> Is that true?
>>>
>>> difference between population (numpy) and sample (scipy.stats)
>>> variance and standard deviation is whether the the estimator is
>>> biased, i.e. 1/n, or not, i.e. 1/(n-1).
>>
>> It's a shame that the "biased/unbiased" terminology still survives in
>> the numpy.std() docstring. It's really quite wrong.
>>
>
> I find talking about biased versus unbiased estimator much clearer
> than the population - sample distinction, and degrees of freedom might
> be more descriptive but its meaning, I guess, relies on knowing about
> the (asymptotic) distribution of the estimator, which I always forget
> and have to look up.
The problem is that the "unbiased" estimate for the standard deviation
is *not* the square root of the "unbiased" estimate for the variance.
The latter is what numpy.std(x, ddof=1) calculates, not the former.
This problem arises because of a pretty narrow concept of "error" that
gets misapplied to the variance estimator. The usual "error" that gets
used is the arithmetic difference between the estimate and the true
value of the parameter (p_est - p_true). For parameters like means,
this is usually fine, but for so-called scale parameters like
variance, it's quite inappropriate. For example, the arithmetic error
between a true value of 1.0 (in whatever units) and an estimate of 2.0
is the same as that between 101.0 and 102.0. When you drop a square
root into that formula, you don't get the same answers out when you
seek the estimator that sets the bias to 0.
Rather, a much more appropriate error measure for variance would be
the log-ratio: log(p_est/p_true). That way, 1.0 and 2.0 would be the
same distance from each other as 100.0 and 200.0. Using this measure
of error to define bias, the unbiased estimate of the standard
deviation actually is the square root of the unbiased estimate of the
variance, too, thanks to the magic of logarithms. Unfortunately for
those who want to call the (n-1) version "unbiased", the unbiased
estimator (for normal distributions, at least) uses (n-2). Oops! Other
distributions have different optimal denominators: heavier-tailed
distributions tend towards (n-3), finite-support distributions tend
towards (n-1).
But of course, bias is not the only thing to be concerned about. The
bias is just the arithmetic average of the errors. If you want to
minimize the total spread of the errors sum(abs(err)), too, that's
another story. With the arithmetic error metric, the unbiased
estimator of the variance uses (n-1) while the estimator with the
smallest total error uses (n). With the log-ratio error metric, the
unbiased estimator is the same as the one that minimizes the total
error. Happy days!
I also find the population/sample distinction to be bogus, too. IIRC,
there are even some sources which switch around the meanings, too. In
any case, the docstring should have a section saying, "If you are
looking for what is called the "unbiased" or "sample" estimate of
variance, use ddof=1." Those terms are widely, if incorrectly, used,
so we should mention them. I just find it disheartening that the terms
are used without qualification.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
More information about the SciPy-user
mailing list