[SciPy-dev] PEP: Improving the basic statistical functions in Scipy

Bruce Southey bsouthey@gmail....
Thu Feb 26 16:26:39 CST 2009


Hi,
I do apologize in advance if this is considered inappropriate but my 
goal is to advance the stats capabilities in Scipy.

I do recognize that there is a large chunk of excellent work so part of 
this is simply to ensure that there is adequate documentation and tests.

The following is my attempt of a PEP to provide some direction on how to 
improve the basic statistical functions within Scipy. I do have a list 
of the individual functions and the arguments involved but I decided it 
was in appropriate to attach it here.

Probably the main aspect that I would like feedback is on whether or not 
there should be a single interface to these basic statistical functions.

Thanks
Bruce

PEP: Improving the basic statistical functions in Scipy
Authors: Bruce Southey
Created: 26-Feb-2009

Abstract
========
This current PEP is orientated towards addressing the fundamental 
problems with the basic statistical functions in Scipy. The outcome is 
to provide Scipy with a consistent, well-tested and documented set of 
basic statistical functions that are available to different array types.

Motivation
========
This PEP addresses the basic statistical functions available in the 
stats component of Scipy. These functions are defined in the following 
files:
stats.py – Defines many statistical functions and imports statlib.
morestats.py – Adds additional statistical functions to stats.py
_support.py - Defines the functions used in stats.py but also is a 
circular because it also imports stats
mstats.py – Just imports functions from mstats_basic.py and mstats_extras.py
mstats_basic.py – Defines statistical functions for masked arrays
mstats_extras.py– Defines additional statistical functions for masked arrays

In total there are 178 unique functions defined in these files, some of 
which are private or internal and some have the same name but are 
defined slightly differently between standard and masked arrays. A list 
of theses functions is available. While the functions are defined for 
standard arrays and masked arrays, not all functions are available for 
both array types. For example, the majority of functions defined in 
stats.py for standard arrays are available in mstats_basic.py. But none 
of the standard arrays functions defined in morestats.py are available 
for masked arrays. Also none of these are functions are directly 
supported for other array types available to Scipy such record arrays, 
record arrays that contain masked data and sparse arrays.

Specification
========

1) Provide the same basic statistical functions with the same arguments 
for standard and masked arrays.
2) Utilize a single interface. For example, the gmean function (note 
_chk_asarray is defined differently):
stats.py:
def gmean(a, axis=0):
a, axis = _chk_asarray(a, axis)
log_a = np.log(a)
return np.exp(log_a.mean(axis=axis))

mstats_basic.py:
def gmean(a, axis=0):
a, axis = _chk_asarray(a, axis)
log_a = ma.log(a)
return ma.exp(log_a.mean(axis=axis))

Rather a single function can be defined as:
def gmean(a, axis=0):
log_a = np.log(a)
return np.exp(log_a.mean(axis=axis))

import numpy as np
import numpy.ma as ma
X=[1,2,3,4,5]
a=np.array(X)
m=ma.array(X, mask=[0,0,0,0,0])
np.exp((np.log(X).mean())) #2.6051710846973517
np.exp((np.log(a).mean())) #2.6051710846973517
np.exp((np.log(m).mean())) #2.6051710846973517

3) Depreciation and removal of unnecessary functions such as linregress.

4) Cleanup styles issues including:
a) White space usage
b) Consistent arguments such as 'a' vs 'x' and the usage of *args
c) Uniquely identifying functions.
i) rootfunc and tempfunc defined two and three times, respectively, in 
morestats.py but have different arguments.
ii) makestr is defined twice in _support.py, once a main function and 
once as a subfunction of printcc.

5) Ensure info.py is complete and correct.

6) Improve the documentation of basic statistical functions in 
connection with the Scipy documentation Marathon 
(http://www.scipy.org/Developer_Zone/DocMarathon2008)

7) Improve the tests of the basic statistical functions:
i) All functions should have at least have basic test coverage that 
indicates whether or not it is functional.
ii) Important functions should have tests that include unexpected 
elements like Nan's, positive and negative infinity and other unexpected 
inputs.
iii) Ideally there should be tests that check the function accuracy.

8) Extension of the functions to other array types available to Scipy 
such record arrays, record arrays that contain masked data and sparse 
arrays. Perhaps beyond the scope of this PEP.

Backwards Compatibility
========

There is no guarantee that the outcome will maintain complete backwards 
compatibility because a consistent API is required across different 
array types. However, any changes to existing APIs must be justified 
such as ensuring the same keywords between functions for different array 
types.


More information about the Scipy-dev mailing list