[Numpy-discussion] var bias reason?
Travis E. Oliphant
Wed Oct 15 09:45:39 CDT 2008
Gabriel Gellner wrote:
> Some colleagues noticed that var uses biased formula's by default in numpy,
> searching for the reason only brought up:
> which I totally agree with, but there was no response? Any reason for this?
I will try to respond to this as it was me who made the change. I think
there have been responses, but I think I've preferred to stay quiet
rather than feed a flame war. Ultimately, it is a matter of preference
and I don't think there would be equal weights given to all the
arguments surrounding the decision by everybody.
I will attempt to articulate my reasons: dividing by n is the maximum
likelihood estimator of variance and I prefer that justification more
than the "un-biased" justification for a default (especially given that
bias is just one part of the "error" in an estimator). Having every
package that computes the mean return the "un-biased" estimate gives it
more cultural weight than than the concept deserves, I think. Any
surprise that is created by the different default should be mitigated by
the fact that it's an opportunity to learn something about what you are
doing. Here is a paper I wrote on the subject that you might find
(Hopefully, they will resolve a link problem at the above site soon, but
you can read the abstract).
I'm not trying to persuade anybody with this email (although if you can
download the paper at the above link, then I am trying to persuade with
that). In this email I'm just trying to give context to the poster as I
think the question is legitimate.
With that said, there is the ddof parameter so that you can change what
the divisor is. I think that is a useful compromise.
I'm unhappy with the internal inconsistency of cov, as I think it was an
oversight. I'd be happy to see cov changed as well to use the ddof
argument instead of the bias keyword, but that is an API change and
requires some transition discussion and work.
The only other argument I've heard against the current situation is
"unit testing" with MATLAB or R code. Just use ddof=1 when comparing
against MATLAB and R code is my suggestion.
More information about the Numpy-discussion