[SciPy-dev] PEP: Improving the basic statistical functions in Scipy
Pierre GM
pgmdevlist@gmail....
Fri Feb 27 12:54:10 CST 2009
All,
I followed the thread without actively participating, but as the
author of stats.mstats, I feel compelled to jump in.
When I started working on some masked versions of scipy.stats version,
numpy.ma wasn't part of numpy per se (if I remember correctly), and
the package hadn't been thouroughly checked. Modifying scipy.stats to
recognize masked arrays (for example, by changing _chk_array) wasn't
really an option at the time, because numpy.ma was still considered as
experimental. The easiest was therefore just to duplicate the
functions. I checked that the results were consistent with the non-
masked versions at the time, checked against R also, so I was fairly
confident in the results. However, I didn't strive for exhaustivity: I
coded the functions I needed and some of them direct relatives, but
never tried to expand some more complex functions.
I'm all in favor for merging the masked and non-masked versions:
that's cleaner, easier to debug and maintain should there be some
changes in signature (or even just doc). There's a few aspects we must
keep in mind however:
* standard numpy functions usually work well with masked arrays: if
the input is MA, the np function should call MA.__array_wrap__ which
will transform the result back to a MA. I have the nagging feeling
it's not completely fool-proof, however, but the numpy.ma functions
should always work. if the input is not a MA, then the output will
never be masked, which may be a problem.
Consider this example:
>>> x=np.array([0,1,2])
>>> np.log(x).mean()
-inf
>>> ma.log(x).mean()
0.34657359027997264
>>> np.log(ma.array(x)).mean()
0.34657359027997264
If we don't transform x to a MA, or don't use the numpy.ma function,
we just get a NaN/Inf as results. Otherwise, we get a nice float.
* Systematically using MA may be problematic: I can picture cases
where a standard ndarray is expected as output when a standard ndarray
is given as inputt. If we force the conversion, the result will be a
MA. Should we convert it back to a ndarray ? Using .filled() ? But
then, with which filling_value ?
In that case, we may want to consider a "usemask" flag: if
usemask=True and the input is a ndarray, then the output will be a MA,
otherwise it'll be a ndarray. Using a MA as input would set usemask to
True no matter what.
* The correlation functions discard masked values pair-wise: we can
pre-process the inputs and still use the standard functions, so no
problem here.
* Some functions (threshold) can work directly w/ MA.
* Some functions (the ones based on ranking) should behave differently
whether the input has masked values (as missing values must be taken
as ties).
About optimization and speed test:
* There's definitely some room for improvement here: for example,
instead of using the count method, we could use the count function to
prevent any unnecessary conversion to MA. (I'd need to optimize the
count function, but that should be easy...). That'll depend on what we
decide for handling MA.
* Just running tests w/ the masked versions of the function will
always show that they are slower, of course.
* Slight differences of the order of 1e-15 should not really matter.
All, don't hesitate to contact me on or off-list if you have some
specific questions about implementation details.
More information about the Scipy-dev
mailing list