[SciPy-dev] PEP: Improving the basic statistical functions in Scipy

Bruce Southey bsouthey@gmail....
Fri Feb 27 21:04:42 CST 2009

On Fri, Feb 27, 2009 at 5:13 PM,  <josef.pktd@gmail.com> wrote:
> On Fri, Feb 27, 2009 at 5:47 PM, Pierre GM <pgmdevlist@gmail.com> wrote:
>> On Feb 27, 2009, at 4:52 PM, josef.pktd@gmail.com wrote:
>>> For most of the current statistical functions, with the exception of
>>> different tie handling, I think that we can expand the _chk_asarray to
>>> do the necessary preprocessing.
>> Mmh. _chk_asarray will always return a MA. Is it what you want? Are you
> No, what I meant was, that _chk_asarray is currently called for
> preprocessing in most functions, so it will be easy to use a replacement
> function to obtain the preprocessed (e.g. compressed) data, and whatever
> flags (usemask) we need, in the main body of the function and for the
> decision about the return type.
I really do not see the requirement for _chk_asarray at all. When a
user passes a typical array or masked array then there should be no
further processing required. Also _chk_asarray will use ravel() if
axis is None but my understanding of many numpy functions operate over
a flattened array when there is no axis defined.

The only case that needs addressing is when a user supplies an object
that can be converted to an array otherwise a error needs to be
raised. After conversion to an array no further processing is required
and even that conversion in some cases will be done within the
existing functions.

>> An idea is then to use the 'usemask' parameter I was talking about
>> earlier:
>> * if usemask is False (default), return a ndarray
>> * If usemask is True, return a MA
>> * if the input is a MA (w/ or w/o missing values), set usemask to
>> True, and mask the NaNs/Infs first w/ ma.fix_invalid.
>> That way, we need only one function. If we really need it, we can have
>> duplicate functions in scipy.mstats where usemask is set to True by
>> default.
>> Now, for the actual implementation:
>> * usemask=False and some NaNs: return NaN
>> * usemask=True: use the ma implementation.
> That clarifies the API. I will try to write a prototype, but I spend
> too much time on scipy this week.

This is a little messy and there has been discussion regarding this
elsewhere. In these terms there are two distinct issues:
1) If the array contains non-finite numbers (NaN, positive and
negative infinity) then perhaps the user can strip these out first for
example R's mean function has the argument 'na.rm = FALSE'.
2) If non-finite elements arise during the function like taking the
log of zero then I think that the user must know these have occurred
rather than be forced to check the mask - especially if they have
already masked values for other reasons like incomplete data.


More information about the Scipy-dev mailing list