[SciPy-dev] Statistics Review progress

Ed Schofield schofield at ftw.at
Wed Apr 12 08:03:23 CDT 2006

Robert Kern wrote:
> This weekend I made a first pass through stats.py and did a fair bit of cleanup.
> I checked in my changes as separate revisions for easier perusal:
> http://projects.scipy.org/scipy/scipy/timeline?from=04%2F09%2F06&daysback=0&changeset=on&update=Update
> I got through about 25 or so functions. For the most part, I focused on the
> docstrings and making the code look nice and use relatively modern numpy idioms.
> Implementing proper unit tests would be the next step for this set of functions.

Well done for your work on this!

> * anova(): I want to get rid of it. It and its support functions take up nearly
> a quarter of the entire code of stats.py. It returns nothing but simply prints
> it out to stdout. It uses globals. It depends on several undocumented,
> uncommented functions that nothing else uses; getting rid of anova() closes a
> lot of other tickets, too (I subscribe to the "making progress by moving
> goalposts" model of development). It's impossible to unit-test since it returns
> no values. Gary Strangman, the original author of most of stats.py, removed it
> from his copy some time ago because he didn't have confidence in its implementation.
> It appears to me that the function would need a complete rewrite to meet the
> standards I have set for passing review. I propose to remove it now.
+1.  This sounds like the best option for the short term.

> * Some of the functions like mean() and std() are replications of functionality
> in numpy and even the methods of array objects themselves. I would like to
> remove them, but I imagine they are being used in various places. There's a
> certain amount of code breakage I'm willing to accept in order to clean up
> stats.py (e.g. all of my other bullet items), but this seems just gratuitous.
I think we should remove the duplicated functions mean, std, and var
from stats.  The corresponding functions are currently imported from
numpy into the stats namespace anyway.

> * We really need to sort out the issue of biased and unbiased estimators. At
> least, a number of scipy.stats functions compute values that could be computed
> in two different ways, conventionally given labels "biased" and "unbiased". Now
> while there is some disagreement as to which is better (you get to guess which I
> prefer), I think we should offer both.
> Normally, I try to follow the design principle that if the value of a keyword
> argument is almost always given as a constant (e.g. bias=True rather than
> bias=flag_set_somewhere_else_in_my_code), then the functionality should be
> exposed as two separate functions. However, there are a lot of these functions
> in scipy.stats, and I don't think we would be doing anyone any favors by
> doubling the number of these functions. IMO, "practicality beats purity" in this
> case.
I'd argue strongly that var and std should be identical to the functions
in numpy.  If we want this we'd need separate functions like varbiased.

I don't really see the benefit of a 'bias' flag.  If we do encounter
some real problems in handling the biased estimators consistently
without it, we might as well argue for modifying the corresponding
functions in numpy.  But it'd be trivial to write

def my_var_function_with_bias_flag(a, bias=True):
    if bias:
        return varbiased(a)
        return var(a)

if this were ever necessary.

> The names "biased" and "unbiased" are, of course up for discussion, since the
> label "biased" is not particularly precise. The default setting is also up for
> discussion.

I can't think of anything better than 'biased'.  'Sample' would be
ambiguous, as Zachary mentioned.  Using 'unbiased' in the names would be
incorrect for std, as Zachary also mentioned,
but we could avoid this by using numpy.std etc instead.  We could also
change the numpy.std docstring to note explicitly that it's the square
root of the unbiased sample variance estimate.

-- Ed

More information about the Scipy-dev mailing list