[SciPy-dev] Homogenizing stats & mstats

Pierre GM pgmdevlist@gmail....
Fri Jul 24 13:23:51 CDT 2009

On Jul 24, 2009, at 11:14 AM, Bruce Southey wrote:
> As I now think about these functions, the stats functions do need to  
> split into at least two parts such as descriptive stats like  
> geometric mean (gmean) and statistical test functions like  
> kendalltau.  Perhaps even adding a set of utility functions like  
> tmax, tmean and tmin (but these are limited to one dimensional  
> arrays).

That was my intention as well to split stats into 2 or 3 files.  
Descriptive stats (means & quantiles) on one side, tests on the other  
sound good. Should we start creating these files already side by side  
with the current stats/mstats files ? Should we create a branch ?

> We also need to address ticket 604:'Statistics functions with new  
> options' at the same time.
> http://projects.scipy.org/scipy/ticket/604


>> * A second step would be to use numpy.ma under the hood, returning
>> either a MaskedArray if the input is a MaskedArray itself, or just a
>> standard ndarray otherwise.
> Really I think that the input object must be preserved unless the  
> user states otherwise. One aspect is that masked arrays  
> automatically masks any noninfinite elements like infinity. For  
> certain stats it is essentially to know that this has occurred as it  
> signals a larger problem but automatically masking this hides this  
> problem. For example:
> c=np.ma.masked_array([1.,2.,3., np.nan], [1,0,0,0] # provides a  
> masked array with NaN
> c/2 # automatically masks the np.nan which is fine if you know but  
> not if you do not want nonfinite values masked.

OK, I see the problem here. We could have this usemask tell us whether  
to use the MA behaviour (invalid output are masked, a MA is output no  
matter the type of the input) or not (NaN/Infs are preserved, a  
standard ndarray is output no matter the type of the input).
Nevertheless, some of the functions (ranking, tests with ties) work  
correctly in mstats and not in stats (compared to R): we could use the  
mstats implementation instead of the stats one, then.

> It would be great to have at least the Matrix class work (record/ 
> structured arrays and even sparse arrays as well) but I do not how  
> sufficient about these to know how.

Not too much a problem for descriptive stats on Matrix if we use  
np.asanyarray. Structured arrays are a different beast, as the  
standard functions (+-/*...) don't work (for a good reason, and this  
may change later on). I've no experience on sparse arrays, so count me  
out on this one.

>> * A third would be to port the remaining routines of mstats.extras to
>> stats or morestats (Harrell-Davies quantiles could be imlemented more
>> efficiently in cython, for example).
>> At each step, we could add a Deprecate warning to a reviewed mstat
>> function and call the corresponding stat function instead.
> Unfortunately there is not a one to one matching between the stats  
> and mstats functions.

Mmh, if we proceed methodically, that shouldn't be too much of a  
problem. Name differences can be easily adressed. Behavior differences  
are trickier, but may be just bugs waiting for us.

> When I started I found 178 functions between the different modules  
> including some that are or should be depreciated. Only about 40  
> functions (plus a few that should be removed) that have the same  
> name in the stats and masked_basic files. I have not checked these  
> to know if these have the exact same behavior as expected by the  
> input type. There are others that perhaps only differ in name.

>> What would be a good time line ? 0.8.0, or is it too late? 0.9.0 ?
> For 0.8 I think we must at least warn users changes are comming for  
> the stats and mstats as well as make sure that any unnecessary  
> functions are depreciated. Also we could start the process to  
> reorganize the stats functions and  combine the stats and mstats  
> functions with the same name and behavior.

When is 0.8.0 supposed to be released ? If it's a matter of just a  
couple of weeks, we can sit on the issue as long as needed. If it's  
longer than that, we should probably get started now.

More information about the Scipy-dev mailing list