[Numpy-discussion] var bias reason?

Travis E. Oliphant oliphant@enthought....
Wed Oct 15 09:45:39 CDT 2008

Gabriel Gellner wrote:
> Some colleagues noticed that var uses biased formula's by default in numpy,
> searching for the reason only brought up:
> http://article.gmane.org/gmane.comp.python.numeric.general/12438/match=var+bias
> which I totally agree with, but there was no response? Any reason for this?
I will try to respond to this as it was me who made the change.  I think 
there have been responses, but I think I've preferred to stay quiet 
rather than feed a flame war.   Ultimately, it is a matter of preference 
and I don't think there would be equal weights given to all the 
arguments surrounding the decision by everybody.

I will attempt to articulate my reasons:  dividing by n is the maximum 
likelihood estimator of variance and I prefer that justification more 
than the "un-biased" justification for a default (especially given that 
bias is just one part of the "error" in an estimator).    Having every 
package that computes the mean return the "un-biased" estimate gives it 
more cultural weight than than the concept deserves, I think.  Any 
surprise that is created by the different default should be mitigated by 
the fact that it's an opportunity to learn something about what you are 
doing.    Here is a paper I wrote on the subject that you might find 

(Hopefully, they will resolve a link problem at the above site soon, but 
you can read the abstract).

I'm not trying to persuade anybody with this email (although if you can 
download the paper at the above link, then I am trying to persuade with 
that).  In this email I'm just trying to give context to the poster as I 
think the question is legitimate.

With that said, there is the ddof parameter so that you can change what 
the divisor is.  I think that is a useful compromise.

I'm unhappy with the internal inconsistency of cov, as I think it was an 
oversight. I'd be happy to see cov changed as well to use the ddof 
argument instead of the bias keyword, but that is an API change and 
requires some transition discussion and work.

The only other argument I've heard against the current situation is 
"unit testing" with MATLAB or R code.   Just use ddof=1 when comparing 
against MATLAB and R code is my suggestion.

Best regards,


More information about the Numpy-discussion mailing list