[Scipy-tickets] [SciPy] #300: stats.py - Population and sample variances backwards, causing downstream errors

SciPy scipy-tickets at scipy.net
Wed Nov 1 19:41:55 CST 2006

```#300: stats.py - Population and sample variances backwards, causing downstream
errors
-------------------------+--------------------------------------------------
Reporter:  zunzun       |        Owner:  somebody
Type:  defect       |       Status:  closed
Priority:  normal       |    Milestone:
Component:  scipy.stats  |      Version:
Severity:  minor        |   Resolution:  wontfix
Keywords:               |
-------------------------+--------------------------------------------------
Changes (by rkern):

* status:  new => closed
* priority:  high => normal
* resolution:  => wontfix
* severity:  major => minor

Old description:

> In stats.py, sample and population variances are backwards.
>
> Sample variation should be divided by (n-1) instead of n,
> and population variance by n instead of (n-1).
>
> This is causing many, many downstream effects on other
> calculations such as standard deviations, kurtosis and skew.
>
> I marked these below with <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>
>      James Phillips
>      zunzun at zunzun.com
>

> def samplevar(a, axis=0):
>     """
> Returns the sample standard deviation of the values in the passed
> array (i.e., using N).  Axis can equal None (ravel array first),
> an integer (the axis over which to operate)
> """
>     a, axis = _chk_asarray(a, axis)
>     mn = expand_dims(mean(a, axis), axis)
>     deviations = a - mn
>     n = a.shape[axis]
>     svar = ss(deviations,axis) / float(n)
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>     return svar
>

> def var(a, axis=0, bias=False):
>     """
> Returns the estimated population variance of the values in the passed
> array (i.e., N-1).  Axis can equal None (ravel array first), or an
> integer (the axis over which to operate).
> """
>     a, axis = _chk_asarray(a, axis)
>     mn = expand_dims(mean(a,axis),axis)
>     deviations = a - mn
>     n = a.shape[axis]
>     vals = ss(deviations,axis)/(n-1.0)
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>     if bias:
>         return vals * (n-1.0)/n
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>     else:
>         return vals

New description:

In stats.py, sample and population variances are backwards.

Sample variation should be divided by (n-1) instead of n,
and population variance by n instead of (n-1).

This is causing many, many downstream effects on other
calculations such as standard deviations, kurtosis and skew.

I marked these below with <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

James Phillips
zunzun at zunzun.com

{{{
#!python
def samplevar(a, axis=0):
"""
Returns the sample standard deviation of the values in the passed
array (i.e., using N).  Axis can equal None (ravel array first),
an integer (the axis over which to operate)
"""
a, axis = _chk_asarray(a, axis)
mn = expand_dims(mean(a, axis), axis)
deviations = a - mn
n = a.shape[axis]
svar = ss(deviations,axis) / float(n)
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
return svar

def var(a, axis=0, bias=False):
"""
Returns the estimated population variance of the values in the passed
array (i.e., N-1).  Axis can equal None (ravel array first), or an
integer (the axis over which to operate).
"""
a, axis = _chk_asarray(a, axis)
mn = expand_dims(mean(a,axis),axis)
deviations = a - mn
n = a.shape[axis]
vals = ss(deviations,axis)/(n-1.0)
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
if bias:
return vals * (n-1.0)/n
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
else:
return vals
}}}

Comment:

This is debatable. For example, !MathWorld defines
[http://mathworld.wolfram.com/SampleVariance.html sample variance] and
population variance opposite to how you define them.
[http://en.wikipedia.org/wiki/Variance Wikipedia] states that the term
"sample variance" is used to name either convention. Consequently, I am
closing this ticket as "wontfix".

The docstrings should use the terms consistently, of course. They should
also explicitly state that they are using N or N-1 rather than relying on
an ambiguous naming convention. Please submit new tickets if either of
these conditions are violated. It would be good to have a paragraph or so
in the {{{stats.py}}} docstring describing the convention. Please submit
one if you want to write one up.

Besides, the correct form is actually N-2.  :-)

--
Ticket URL: <http://projects.scipy.org/scipy/scipy/ticket/300#comment:1>
SciPy <http://www.scipy.org/>
SciPy is open-source software for mathematics, science, and engineering.
```