[SciPy-dev] PEP: Improving the basic statistical functions in Scipy
Pierre GM
pgmdevlist@gmail....
Fri Feb 27 14:05:18 CST 2009
>
> Given that gmean is a very simple function, I was pretty surprised
> about
> the difference in timing. Now, I think that the main slowdown is
> that the
> mask has to be checked in every operation that calls a ma.* version
> of a function.
It's actually a tad more complex:
ma.log checks the mask of the input, but also converts the output to a
MA when needed, with all the overhead of MA.__array_finalize__.
> As we discussed for the OLS case for larger statistical functions,
> building
> the main workload with plain arrays will save a lot of overhead.
> This works
> for cases where a single compression or fill is correct for all
> required
> numerical operations.
That's indeed the way to go: preprocess a MA to transform it into a
ndarray (by dropping masked values, or processing them afterwards),
perform the operation, revert to MA if needed.
> One more issue is the treatment of nan and masked values, for
> example, if a
> function produces nans because of a zero division, then I would want
> to treat
> it differently than a missing value in the data. If it is
> automatically included in the
> mask then this distinction is lost. Or is there a different use case
> for this?
Nope. If a value get masked by an operation, you won't be able to
track it (unless by comparing the mask of the output w/ the mask of
the input).
> In your log example, I wouldn't want to get a nice number back. I want
> the function
> to complain.
Because you work w/ ndarrays. If I work w/ MA, I expect it not to
crash but drop the masked values.
> But, before we start to rewrite and refactor across the board, I still
> want to finish
> cleaning up the existing functions and resolve some of the current
> inconsistencies.
Well, you may double the workload. One way would be to first agree on
how we should refactor/reorganize the functions, then clean the
ndarray part of the function. We can always add a NotImplementedError
if the input is a MA w/ missing values.
More information about the Scipy-dev
mailing list