[SciPy-dev] PEP: Improving the basic statistical functions in Scipy

Pierre GM pgmdevlist@gmail....
Fri Feb 27 14:05:18 CST 2009

> Given that gmean is a very simple function, I was pretty surprised  
> about
> the difference in timing. Now, I think that the main slowdown is  
> that the
> mask has to be checked in every operation that calls a ma.* version
> of a function.

It's actually a tad more complex:
ma.log checks the mask of the input, but also converts the output to a  
MA when needed, with all the overhead of MA.__array_finalize__.

> As we discussed for the OLS case for larger statistical functions,  
> building
> the main workload with plain arrays will save a lot of overhead.  
> This works
> for cases where a single compression or fill is correct for all  
> required
> numerical operations.

That's indeed the way to go: preprocess a MA to transform it into a  
ndarray (by dropping masked values, or processing them afterwards),  
perform the operation, revert to MA if needed.

> One more issue is the treatment of nan and masked values, for  
> example, if a
> function produces nans because of a zero division, then I would want  
> to treat
> it differently than a missing value in the data. If it is
> automatically included in the
> mask then this distinction is lost. Or is there a different use case  
> for this?

Nope. If a value get masked by an operation, you won't be able to  
track it (unless by comparing the mask of the output w/ the mask of  
the input).

> In your log example, I wouldn't want to get a nice number back. I want
> the function
> to complain.

Because you work w/ ndarrays. If I work w/ MA, I expect it not to  
crash but drop the masked values.

> But, before we start to rewrite and refactor across the board, I still
> want to finish
> cleaning up the existing functions and resolve some of the current
> inconsistencies.

Well, you may double the workload. One way would be to first agree on  
how we should refactor/reorganize the functions, then clean the  
ndarray part of the function. We can always add a NotImplementedError  
if the input is a MA w/ missing values.

More information about the Scipy-dev mailing list