[SciPy-dev] PEP: Improving the basic statistical functions in Scipy

josef.pktd@gmai... josef.pktd@gmai...
Fri Feb 27 21:52:55 CST 2009


On Fri, Feb 27, 2009 at 10:04 PM, Bruce Southey <bsouthey@gmail.com> wrote:
> On Fri, Feb 27, 2009 at 5:13 PM,  <josef.pktd@gmail.com> wrote:
>> On Fri, Feb 27, 2009 at 5:47 PM, Pierre GM <pgmdevlist@gmail.com> wrote:
>>>
>>> On Feb 27, 2009, at 4:52 PM, josef.pktd@gmail.com wrote:
>>>>
>>>> For most of the current statistical functions, with the exception of
>>>> different tie handling, I think that we can expand the _chk_asarray to
>>>> do the necessary preprocessing.
>>>
>>> Mmh. _chk_asarray will always return a MA. Is it what you want? Are you
>>>
>> No, what I meant was, that _chk_asarray is currently called for
>> preprocessing in most functions, so it will be easy to use a replacement
>> function to obtain the preprocessed (e.g. compressed) data, and whatever
>> flags (usemask) we need, in the main body of the function and for the
>> decision about the return type.
>>
> I really do not see the requirement for _chk_asarray at all. When a
> user passes a typical array or masked array then there should be no
> further processing required. Also _chk_asarray will use ravel() if
> axis is None but my understanding of many numpy functions operate over
> a flattened array when there is no axis defined.
>
> The only case that needs addressing is when a user supplies an object
> that can be converted to an array otherwise a error needs to be
> raised. After conversion to an array no further processing is required
> and even that conversion in some cases will be done within the
> existing functions.

The current usage allows to pass lists instead of arrays.
This is very convenient for interactive use but might also have other
uses, e.g when building a list incrementally. And I thought asarray
doesn't have much cost if it is already an array.

I didn't look systematically at ravel, but while axis=None works
automatically for many numpy functions, for more complex statistical
functions more control over the dimension of the input arrays is
necessary. Many statistical functions are only designed for 1d or 2d
and controlling the dimension at the beginning simplifies the main
part of the functions. I had some cases where I was struggling for a
while with the dimensions and axis, but in many cases it could be
redundant.

If we want to handle different array types with the same function then
the _chk_asarray call will be replaced by the type specific
preprocessing.

>
>
>>
>>> An idea is then to use the 'usemask' parameter I was talking about
>>> earlier:
>>> * if usemask is False (default), return a ndarray
>>> * If usemask is True, return a MA
>>> * if the input is a MA (w/ or w/o missing values), set usemask to
>>> True, and mask the NaNs/Infs first w/ ma.fix_invalid.
>>>
>>> That way, we need only one function. If we really need it, we can have
>>> duplicate functions in scipy.mstats where usemask is set to True by
>>> default.
>>>
>>> Now, for the actual implementation:
>>> * usemask=False and some NaNs: return NaN
>>> * usemask=True: use the ma implementation.
>>>
>>
>> That clarifies the API. I will try to write a prototype, but I spend
>> too much time on scipy this week.
>
> This is a little messy and there has been discussion regarding this
> elsewhere. In these terms there are two distinct issues:
> 1) If the array contains non-finite numbers (NaN, positive and
> negative infinity) then perhaps the user can strip these out first for
> example R's mean function has the argument 'na.rm = FALSE'.

If the merged functions are able to handle masked arrays and plain
ndarrays, then we can also offer the user the option for the treatment
of nans, this would make the separate nanmean, ... obsolete. Operation
on inf might be too ambiguous and I would think they are the
responsibility of the user. And if there is a inf*0 then I want to
give them the nan back, and the user can decide what to do. In general
inf is a legitimate number and might or should propagate correctly (if
the user wants to leave them in) e.g.
>>> stats.norm.cdf(-np.inf)
0.0
>>> stats.norm.cdf(np.inf)
1.0


> 2) If non-finite elements arise during the function like taking the
> log of zero then I think that the user must know these have occurred
> rather than be forced to check the mask - especially if they have
> already masked values for other reasons like incomplete data.

I agree and I want this behavior for ndarrays, for masked arrays I'm
less involved since I'm not using them (yet).
I like Erics use of a trivariate (?) choice with "auto" which adds one
option for the user:

masked='auto' : True|False|'auto' determines the output;
   if True, output will be a masked array;
   if False, output will be an ndarray with nan used as a bad flag if necessary;
   if 'auto', output will match input

What the exact definition is for masked arrays in the case "auto", is
up to the masked array users.

Josef


More information about the Scipy-dev mailing list