[SciPy-dev] Homogenizing stats & mstats
Fri Jul 24 15:12:49 CDT 2009
On 07/24/2009 01:23 PM, Pierre GM wrote:
> On Jul 24, 2009, at 11:14 AM, Bruce Southey wrote:
>> As I now think about these functions, the stats functions do need to
>> split into at least two parts such as descriptive stats like
>> geometric mean (gmean) and statistical test functions like
>> kendalltau. Perhaps even adding a set of utility functions like
>> tmax, tmean and tmin (but these are limited to one dimensional
> That was my intention as well to split stats into 2 or 3 files.
> Descriptive stats (means& quantiles) on one side, tests on the other
> sound good. Should we start creating these files already side by side
> with the current stats/mstats files ? Should we create a branch ?
Just go ahead and do what you want! :-)
The real issue is whether or not the stats files will be replaced by the
new versions or be a new entity (that could then replace the old
versions). Initially it would good to keep the old versions around to
check and test functionality.
>> We also need to address ticket 604:'Statistics functions with new
>> options' at the same time.
>>> * A second step would be to use numpy.ma under the hood, returning
>>> either a MaskedArray if the input is a MaskedArray itself, or just a
>>> standard ndarray otherwise.
>> Really I think that the input object must be preserved unless the
>> user states otherwise. One aspect is that masked arrays
>> automatically masks any noninfinite elements like infinity. For
>> certain stats it is essentially to know that this has occurred as it
>> signals a larger problem but automatically masking this hides this
>> problem. For example:
>> c=np.ma.masked_array([1.,2.,3., np.nan], [1,0,0,0] # provides a
>> masked array with NaN
>> c/2 # automatically masks the np.nan which is fine if you know but
>> not if you do not want nonfinite values masked.
> OK, I see the problem here. We could have this usemask tell us whether
> to use the MA behaviour (invalid output are masked, a MA is output no
> matter the type of the input) or not (NaN/Infs are preserved, a
> standard ndarray is output no matter the type of the input).
> Nevertheless, some of the functions (ranking, tests with ties) work
> correctly in mstats and not in stats (compared to R): we could use the
> mstats implementation instead of the stats one, then.
>> It would be great to have at least the Matrix class work (record/
>> structured arrays and even sparse arrays as well) but I do not how
>> sufficient about these to know how.
> Not too much a problem for descriptive stats on Matrix if we use
> np.asanyarray. Structured arrays are a different beast, as the
> standard functions (+-/*...) don't work (for a good reason, and this
> may change later on). I've no experience on sparse arrays, so count me
> out on this one.
Sounds like Matrix should be sufficiently easy to incorporate and we
leave the rest on the wish list.
>>> * A third would be to port the remaining routines of mstats.extras to
>>> stats or morestats (Harrell-Davies quantiles could be imlemented more
>>> efficiently in cython, for example).
>>> At each step, we could add a Deprecate warning to a reviewed mstat
>>> function and call the corresponding stat function instead.
>> Unfortunately there is not a one to one matching between the stats
>> and mstats functions.
> Mmh, if we proceed methodically, that shouldn't be too much of a
> problem. Name differences can be easily adressed. Behavior differences
> are trickier, but may be just bugs waiting for us.
>> When I started I found 178 functions between the different modules
>> including some that are or should be depreciated. Only about 40
>> functions (plus a few that should be removed) that have the same
>> name in the stats and masked_basic files. I have not checked these
>> to know if these have the exact same behavior as expected by the
>> input type. There are others that perhaps only differ in name.
>>> What would be a good time line ? 0.8.0, or is it too late? 0.9.0 ?
>> For 0.8 I think we must at least warn users changes are comming for
>> the stats and mstats as well as make sure that any unnecessary
>> functions are depreciated. Also we could start the process to
>> reorganize the stats functions and combine the stats and mstats
>> functions with the same name and behavior.
> When is 0.8.0 supposed to be released ? If it's a matter of just a
> couple of weeks, we can sit on the issue as long as needed. If it's
> longer than that, we should probably get started now.
While I can not help immediately with this, some I had submitted patches
for. So hopefully the following will help.
These functions just rename existing functions and perhaps the renaming,
as necessary, should be elsewhere (like the distributions):
These function are/should be depreciated
I thought that these could be replaced by a one liner using the compress
method because these only work for 1d arrays; ie for some cutoff values
minval and maxval:
tmean a.compress((a>minval) & (a<maxval)).mean()
tmin a.compress((a>minval) & (a<maxval)).min()
tsem a.compress((a>minval) & (a<maxval)).std() with df=n
tstd a.compress((a>minval) & (a<maxval)).std() with df=n-1
tvar a.compress((a>minval) & (a<maxval)).var()
Actually these probably should be depreciated in favor of the mstats
approach for trimmed_mean etc that have an axis keyword indicating the
support for multiple dimensions.
Below is a list I complied for the different functions that have the
same name in both stats and mstats (really mstats_basic). For the most
part these have the same arguments but not always. Also some are or
should be depreciated or are unnecessary.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Scipy-dev