[SciPy-dev] PEP: Improving the basic statistical functions in Scipy

Pierre GM pgmdevlist@gmail....
Fri Feb 27 12:54:10 CST 2009


All,
I followed the thread without actively participating, but as the  
author of stats.mstats, I feel compelled to jump in.

When I started working on some masked versions of scipy.stats version,  
numpy.ma wasn't part of numpy per se (if I remember correctly), and  
the package hadn't been thouroughly checked. Modifying scipy.stats to  
recognize masked arrays (for example, by changing _chk_array) wasn't  
really an option at the time, because numpy.ma was still considered as  
experimental. The easiest was therefore just to duplicate the  
functions. I checked that the results were consistent with the non- 
masked versions at the time, checked against R also, so I was fairly  
confident in the results. However, I didn't strive for exhaustivity: I  
coded the functions I needed and some of them direct relatives, but  
never tried to expand some more complex functions.

I'm all in favor for merging the masked and non-masked versions:  
that's cleaner, easier to debug and maintain should there be some  
changes in signature (or even just doc). There's a few aspects we must  
keep in mind however:

* standard numpy functions usually work well with masked arrays: if  
the input is MA, the np function should call MA.__array_wrap__ which  
will transform the result back to a MA. I have the nagging feeling  
it's not completely fool-proof, however, but the numpy.ma functions  
should always work. if the input is not a MA, then the output will  
never be masked, which may be a problem.

Consider this example:
 >>> x=np.array([0,1,2])
 >>> np.log(x).mean()
-inf
 >>> ma.log(x).mean()
  0.34657359027997264
 >>> np.log(ma.array(x)).mean()
  0.34657359027997264

If we don't transform x to a MA, or don't use the numpy.ma function,  
we just get a NaN/Inf as results. Otherwise, we get a nice float.

* Systematically using MA may be problematic: I can picture cases  
where a standard ndarray is expected as output when a standard ndarray  
is given as inputt. If we force the conversion, the result will be a  
MA. Should we convert it back to a ndarray ? Using .filled() ? But  
then, with which filling_value ?

In that case, we may want to consider a "usemask" flag: if  
usemask=True and the input is a ndarray, then the output will be a MA,  
otherwise it'll be a ndarray. Using a MA as input would set usemask to  
True no matter what.

* The correlation functions discard masked values pair-wise: we can  
pre-process the inputs and still use the standard functions, so no  
problem here.

* Some functions (threshold) can work directly w/ MA.

* Some functions (the ones based on ranking) should behave differently  
whether the input has masked values (as missing values must be taken  
as ties).


About optimization and speed test:
* There's definitely some room for improvement here: for example,  
instead of using the count method, we could use the count function to  
prevent any unnecessary conversion to MA.  (I'd need to optimize the  
count function, but that should be easy...). That'll depend on what we  
decide for handling MA.
* Just running tests w/ the masked versions of the function will  
always show that they are slower, of course.
*  Slight differences of the order of 1e-15 should not really matter.

All, don't hesitate to contact me on or off-list if you have some  
specific questions about implementation details.





More information about the Scipy-dev mailing list