[Scipy-tickets] [SciPy] #300: stats.py - Population and sample variances backwards, causing downstream errors

SciPy scipy-tickets at scipy.net
Wed Nov 1 19:41:55 CST 2006


#300: stats.py - Population and sample variances backwards, causing downstream
errors
-------------------------+--------------------------------------------------
 Reporter:  zunzun       |        Owner:  somebody
     Type:  defect       |       Status:  closed  
 Priority:  normal       |    Milestone:          
Component:  scipy.stats  |      Version:          
 Severity:  minor        |   Resolution:  wontfix 
 Keywords:               |  
-------------------------+--------------------------------------------------
Changes (by rkern):

  * status:  new => closed
  * priority:  high => normal
  * resolution:  => wontfix
  * severity:  major => minor

Old description:

> In stats.py, sample and population variances are backwards.
>
> Sample variation should be divided by (n-1) instead of n,
> and population variance by n instead of (n-1).
>
> This is causing many, many downstream effects on other
> calculations such as standard deviations, kurtosis and skew.
>
> I marked these below with <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>
>      James Phillips
>      zunzun at zunzun.com
>

> def samplevar(a, axis=0):
>     """
> Returns the sample standard deviation of the values in the passed
> array (i.e., using N).  Axis can equal None (ravel array first),
> an integer (the axis over which to operate)
> """
>     a, axis = _chk_asarray(a, axis)
>     mn = expand_dims(mean(a, axis), axis)
>     deviations = a - mn
>     n = a.shape[axis]
>     svar = ss(deviations,axis) / float(n)
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>     return svar
>

> def var(a, axis=0, bias=False):
>     """
> Returns the estimated population variance of the values in the passed
> array (i.e., N-1).  Axis can equal None (ravel array first), or an
> integer (the axis over which to operate).
> """
>     a, axis = _chk_asarray(a, axis)
>     mn = expand_dims(mean(a,axis),axis)
>     deviations = a - mn
>     n = a.shape[axis]
>     vals = ss(deviations,axis)/(n-1.0)
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>     if bias:
>         return vals * (n-1.0)/n
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>     else:
>         return vals

New description:

 In stats.py, sample and population variances are backwards.

 Sample variation should be divided by (n-1) instead of n,
 and population variance by n instead of (n-1).

 This is causing many, many downstream effects on other
 calculations such as standard deviations, kurtosis and skew.

 I marked these below with <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

      James Phillips
      zunzun at zunzun.com

 {{{
 #!python
 def samplevar(a, axis=0):
     """
 Returns the sample standard deviation of the values in the passed
 array (i.e., using N).  Axis can equal None (ravel array first),
 an integer (the axis over which to operate)
 """
     a, axis = _chk_asarray(a, axis)
     mn = expand_dims(mean(a, axis), axis)
     deviations = a - mn
     n = a.shape[axis]
     svar = ss(deviations,axis) / float(n)
 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
     return svar


 def var(a, axis=0, bias=False):
     """
 Returns the estimated population variance of the values in the passed
 array (i.e., N-1).  Axis can equal None (ravel array first), or an
 integer (the axis over which to operate).
 """
     a, axis = _chk_asarray(a, axis)
     mn = expand_dims(mean(a,axis),axis)
     deviations = a - mn
     n = a.shape[axis]
     vals = ss(deviations,axis)/(n-1.0)
 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
     if bias:
         return vals * (n-1.0)/n
 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
     else:
         return vals
 }}}

Comment:

 This is debatable. For example, !MathWorld defines
 [http://mathworld.wolfram.com/SampleVariance.html sample variance] and
 population variance opposite to how you define them.
 [http://en.wikipedia.org/wiki/Variance Wikipedia] states that the term
 "sample variance" is used to name either convention. Consequently, I am
 closing this ticket as "wontfix".

 The docstrings should use the terms consistently, of course. They should
 also explicitly state that they are using N or N-1 rather than relying on
 an ambiguous naming convention. Please submit new tickets if either of
 these conditions are violated. It would be good to have a paragraph or so
 in the {{{stats.py}}} docstring describing the convention. Please submit
 one if you want to write one up.

 Besides, the correct form is actually N-2.  :-)

-- 
Ticket URL: <http://projects.scipy.org/scipy/scipy/ticket/300#comment:1>
SciPy <http://www.scipy.org/>
SciPy is open-source software for mathematics, science, and engineering.


More information about the Scipy-tickets mailing list