[SciPy-dev] scipy.stats._chk_asarray

Bruce Southey bsouthey@gmail....
Wed Jun 3 10:42:53 CDT 2009

josef.pktd@gmail.com wrote:
> On Wed, Jun 3, 2009 at 10:05 AM, Bruce Southey <bsouthey@gmail.com> wrote:
>> josef.pktd@gmail.com wrote:
>>> On Wed, Jun 3, 2009 at 12:55 AM, Robert Kern <robert.kern@gmail.com> wrote:
>>>> On Tue, Jun 2, 2009 at 23:50, Pierre GM <pgmdevlist@gmail.com> wrote:
>>>>> On Jun 2, 2009, at 11:09 PM, josef.pktd@gmail.com wrote:
>>>>>>> I tried to see if I can introduce a second version _check_asanyarray,
>>>>>> that doesn't convert to basic np.array, but I didn't get very far.
>>>>>> nanmedian, and nanstd are not easy to convert to work with matrices,
>>>>>> nanstd uses multiplication and nanmedian uses np.compress
>>>>> Well, what about that:
>>>>> * convert the inputs to ndarray w/ _chk_asarray
>>>>> * compute as usual
>>>>> * return a view of the result using the type of the input (using the
>>>>> type keyword of view)
>>>>> That should work w/ nanmedian. There might be some adjustment to make
>>>>> for nanstd (pb of dimensions?)
>>>> That is what I was suggesting, only in decorator form so it could be
>>>> applied everywhere. It's not worth wasting time making a small handful
>>>> of functions work and be inconsistent with all of the others.
>>> If someone gives me this decorator, I will use it, but I don't know
>>> how to write a decorator that works for all input and output cases,
>>> and doesn't screw up our documentation system.
>>> But I can change 2 lines per function, and I know I still have the
>>> same signature and docstring. It looks like it will work for all
>>> descriptive statistics and data transformation in scipy.stats. It
>>> won't be relevant for most of the remainder.
>>> Josef
>>> _______________________________________________
>>> Scipy-dev mailing list
>>> Scipy-dev@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>> Hi,
>> Using stats._chk_asarray should be completely unnecessary because most
>> of numpy functions accept array-like inputs and use flattened arrays by
>> default unless the axis keyword is used. That is why I did not use it
>> for the stats.gmean and stats.hmean patches.
>> I am also curious why the nanmean is so involved when I would think
>> that, for some array b and axis, you can just do:
>> numpy.nansum(b,axis=axis)/numpy.sum(numpy.isfinite(b), axis=axis)
> For large, badly scaled arrays this might not be a numerically precise
> way of doing it. But I agree that many functions could be written as
> one liners where the only advantage I see, is that we don't have to
> remember the formula.
It is no worse than the current function and this has less computer 
operations. If an array is sufficiently large both functions are going 
to die if the sum of elements exceeds numerical precision. There is 
nothing in either function to address 'badly scaled arrays' . In either 
case, using higher precision or alternative algorithm is necessary.

>> Granted nanstd is more complex and, in both cases, these probably should
>> be part of numpy.
> a**2 and a*b have completely different meaning for matrices than for
> ndarrays. Without conversion, writing any more complex statistical
> function would be a major hassle.
Well, as you know, the functions in scipy.stats.py are really just  
accept the plain standard numpy arrays so any other array type should 
not work at all and, technically, should automatically fail with the 
incorrect input.  Rather I think that these are totally relying on numpy 
to do the expected thing for arrays types that are not the basic type.  
Special functions need to be written to address the quirks of these 
array types (like the scipy.mstats.py for masked arrays) or new 
functions that address all 'supported' array types.

> As I mentioned before, I tried with nanmedian and nanstd and gave up
> very fast, since many functions don't work correctly or have a
> different meaning. Writing code that is not allowed to use `*` looks
> pretty hard to read and to write. I haven't tried what happens if
> someone throws a sparse matrix at the stats functions, but we get
> wrong results using for example np.dot.
> Josef
There should be no expectation for the scipy.stats functions to work at 
all for sparse matrix inputs!

I did not see that the wrong result occurred with numpy.dot with sparse 
matrices because it was clearly an user error. I did not think that 
scipy.sparse is even a subclass of numpy's arrays so I found it a 
surprise that the operations actually worked (kudos to the developers!).


More information about the Scipy-dev mailing list