[SciPy-dev] PEP: Improving the basic statistical functions in Scipy
Bruce Southey
bsouthey@gmail....
Thu Feb 26 16:26:39 CST 2009
Hi,
I do apologize in advance if this is considered inappropriate but my
goal is to advance the stats capabilities in Scipy.
I do recognize that there is a large chunk of excellent work so part of
this is simply to ensure that there is adequate documentation and tests.
The following is my attempt of a PEP to provide some direction on how to
improve the basic statistical functions within Scipy. I do have a list
of the individual functions and the arguments involved but I decided it
was in appropriate to attach it here.
Probably the main aspect that I would like feedback is on whether or not
there should be a single interface to these basic statistical functions.
Thanks
Bruce
PEP: Improving the basic statistical functions in Scipy
Authors: Bruce Southey
Created: 26-Feb-2009
Abstract
========
This current PEP is orientated towards addressing the fundamental
problems with the basic statistical functions in Scipy. The outcome is
to provide Scipy with a consistent, well-tested and documented set of
basic statistical functions that are available to different array types.
Motivation
========
This PEP addresses the basic statistical functions available in the
stats component of Scipy. These functions are defined in the following
files:
stats.py – Defines many statistical functions and imports statlib.
morestats.py – Adds additional statistical functions to stats.py
_support.py - Defines the functions used in stats.py but also is a
circular because it also imports stats
mstats.py – Just imports functions from mstats_basic.py and mstats_extras.py
mstats_basic.py – Defines statistical functions for masked arrays
mstats_extras.py– Defines additional statistical functions for masked arrays
In total there are 178 unique functions defined in these files, some of
which are private or internal and some have the same name but are
defined slightly differently between standard and masked arrays. A list
of theses functions is available. While the functions are defined for
standard arrays and masked arrays, not all functions are available for
both array types. For example, the majority of functions defined in
stats.py for standard arrays are available in mstats_basic.py. But none
of the standard arrays functions defined in morestats.py are available
for masked arrays. Also none of these are functions are directly
supported for other array types available to Scipy such record arrays,
record arrays that contain masked data and sparse arrays.
Specification
========
1) Provide the same basic statistical functions with the same arguments
for standard and masked arrays.
2) Utilize a single interface. For example, the gmean function (note
_chk_asarray is defined differently):
stats.py:
def gmean(a, axis=0):
a, axis = _chk_asarray(a, axis)
log_a = np.log(a)
return np.exp(log_a.mean(axis=axis))
mstats_basic.py:
def gmean(a, axis=0):
a, axis = _chk_asarray(a, axis)
log_a = ma.log(a)
return ma.exp(log_a.mean(axis=axis))
Rather a single function can be defined as:
def gmean(a, axis=0):
log_a = np.log(a)
return np.exp(log_a.mean(axis=axis))
import numpy as np
import numpy.ma as ma
X=[1,2,3,4,5]
a=np.array(X)
m=ma.array(X, mask=[0,0,0,0,0])
np.exp((np.log(X).mean())) #2.6051710846973517
np.exp((np.log(a).mean())) #2.6051710846973517
np.exp((np.log(m).mean())) #2.6051710846973517
3) Depreciation and removal of unnecessary functions such as linregress.
4) Cleanup styles issues including:
a) White space usage
b) Consistent arguments such as 'a' vs 'x' and the usage of *args
c) Uniquely identifying functions.
i) rootfunc and tempfunc defined two and three times, respectively, in
morestats.py but have different arguments.
ii) makestr is defined twice in _support.py, once a main function and
once as a subfunction of printcc.
5) Ensure info.py is complete and correct.
6) Improve the documentation of basic statistical functions in
connection with the Scipy documentation Marathon
(http://www.scipy.org/Developer_Zone/DocMarathon2008)
7) Improve the tests of the basic statistical functions:
i) All functions should have at least have basic test coverage that
indicates whether or not it is functional.
ii) Important functions should have tests that include unexpected
elements like Nan's, positive and negative infinity and other unexpected
inputs.
iii) Ideally there should be tests that check the function accuracy.
8) Extension of the functions to other array types available to Scipy
such record arrays, record arrays that contain masked data and sparse
arrays. Perhaps beyond the scope of this PEP.
Backwards Compatibility
========
There is no guarantee that the outcome will maintain complete backwards
compatibility because a consistent API is required across different
array types. However, any changes to existing APIs must be justified
such as ensuring the same keywords between functions for different array
types.
More information about the Scipy-dev
mailing list