[SciPy-dev] Homogenizing stats & mstats
Fri Jul 24 10:14:18 CDT 2009
On 07/24/2009 01:15 AM, Pierre GM wrote:
> I was browsing some recent tickets for scipy.stats, and couldn't but
> noticed that a significant number of them (#845, #822, #901...), are
> related to some lack of consistency between stats and mstats.
> I'd like to eventually get rid of mstats all together, provided the
> same functionalities are supported in stats.
Yeah, that would be great but I ran out of steam to do more and have not
found the time to go back.
> * A first step would be to use np.asanyarray instead of np.asarray.
> That should be sufficient for functions like gmean and hmean for
Well there should be a couple of patches for those two.
It was not clear if some functions should be in scipy or even stats at
least in their current form (this made me stop what I was doing). I
really hope that Numpy will eventually provide for something like
nanmean and nanstd. In some cases these appeared to limited to specific
array dimensions (trimboth), others appear to be one liners and those
with different names but may be the same function (trim_mean and
As I now think about these functions, the stats functions do need to
split into at least two parts such as descriptive stats like geometric
mean (gmean) and statistical test functions like kendalltau. Perhaps
even adding a set of utility functions like tmax, tmean and tmin (but
these are limited to one dimensional arrays).
We also need to address ticket 604:'Statistics functions with new
options' at the same time.
> * A second step would be to use numpy.ma under the hood, returning
> either a MaskedArray if the input is a MaskedArray itself, or just a
> standard ndarray otherwise. That should take care of the functions
> related to ranking and tie handling (I'm pretty confident into the
> mstats routines, and we can always double-check the results w/ R). If
> needed, we could also add a usemask flag, like we do in
Really I think that the input object must be preserved unless the user
states otherwise. One aspect is that masked arrays automatically masks
any noninfinite elements like infinity. For certain stats it is
essentially to know that this has occurred as it signals a larger
problem but automatically masking this hides this problem. For example:
c=np.ma.masked_array([1.,2.,3., np.nan], [1,0,0,0] # provides a masked
array with NaN
c/2 # automatically masks the np.nan which is fine if you know but not
if you do not want nonfinite values masked.
It would be great to have at least the Matrix class work
(record/structured arrays and even sparse arrays as well) but I do not
how sufficient about these to know how.
> * A third would be to port the remaining routines of mstats.extras to
> stats or morestats (Harrell-Davies quantiles could be imlemented more
> efficiently in cython, for example).
> At each step, we could add a Deprecate warning to a reviewed mstat
> function and call the corresponding stat function instead.
Unfortunately there is not a one to one matching between the stats and
mstats functions. When I started I found 178 functions between the
different modules including some that are or should be depreciated. Only
about 40 functions (plus a few that should be removed) that have the
same name in the stats and masked_basic files. I have not checked these
to know if these have the exact same behavior as expected by the input
type. There are others that perhaps only differ in name.
> What would be a good time line ? 0.8.0, or is it too late? 0.9.0 ?
For 0.8 I think we must at least warn users changes are comming for the
stats and mstats as well as make sure that any unnecessary functions are
depreciated. Also we could start the process to reorganize the stats
functions and combine the stats and mstats functions with the same name
> Comments expected.
> Thx in advance
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Scipy-dev