[SciPy-dev] PEP: Improving the basic statistical functions in Scipy

josef.pktd@gmai... josef.pktd@gmai...
Fri Feb 27 12:27:24 CST 2009


On Fri, Feb 27, 2009 at 12:42 PM, Bruce Southey <bsouthey@gmail.com> wrote:
> josef.pktd@gmail.com wrote:
> [snip]
>> What I would like to do, but didn't have the time yet is to run the
>> tests for stats.stats
>> on stats.mstats. This way even if we would have some duplicate
>> functions, we would
>> have some cross check that they are consistent, and it would be a reminder for
>> bug fixing also the other version.
>>
> Okay, I do not know how to get timeit to work with numpy/scipy but this
> is not how I would like it to be. But I managed somehow to (unfairly)
> compare the geometric means function (gmean) using this code:
> import timeit
> stand_t=timeit.Timer('scipy.stats.stats.gmean(X, axis=xs)', 'import
> numpy, scipy.stats.stats; X=numpy.random.gamma(shape=2, scale=1,
> size=(1,10)); xs=None').timeit(1000)
> masked_t=timeit.Timer('scipy.stats.mstats.gmean(X, axis=xs)', 'import
> numpy, scipy.stats.stats; X=numpy.random.gamma(shape=2, scale=1,
> size=(1,10)); xs=None').timeit(1000)
> numpy_t=timeit.Timer('numpy.exp((numpy.log(X).mean()))', 'import numpy,
> numpy.random; X=numpy.random.gamma(shape=2, scale=1,
> size=(1,10))').timeit(1000)
>
> I use Linux and Python 2.5 but my system is very buzy so perhaps not
> that fair for benchmarks.
> numpy.__version__  '1.3.0.dev6338'
> scipy.__version__ '0.8.0.dev5597'
>
> There is a cost of using _chk_asarray in this case which decreases as
> the array size increases. (I am not sure that _chk_asarray is really
> needed anyhow.)
> There is a huge cost for using masked array for small sizes but
> decreases as the array size increases.
>
> For 1 by 10 array, the difference between masked and non masked versions
> was 0.13 seconds to do it 1000 times with the ratio of masked to non
> masked = 7.94
> For 1 by 10000 array, the difference between masked and non masked
> versions was 0.07 seconds to do it 1000 times with the ratio of masked
> to non masked = 2.14
>
> However, briefly looking at some of these functions, I think that
> numpy/scipy would naturally handle the array type as I know
> numpy.exp((numpy.log(X).mean())) this works whether X is the usual array
> or if it is a masked array. If so then there is no reason for different
> functions  unless we need to address masks.
>
>
> Bruce
>

I just ran the stats.stats test using mstats instead of stats. I
didn't look at the results carefully, but the are some numerical
inconsistencies between the two implementation, that need to be
checked.
I attached the test results to http://scipy.org/scipy/scipy/ticket/845.

Your timing numbers don't sound so bad in absolute terms, but if it is
inside an optimization loop, eg. for maximum likelihood estimation
then an 8-fold slowdown can get painful. The main problem for the
basic functions, I think, are those functions that need a loop because
the data is not rectangular and cannot use simple broad casting and
matrix/array operations.

On the other hand, I don't think that the masked array functions have
been checked for performance ("premature optimization") , since many
of them are still relatively new.

Josef


More information about the Scipy-dev mailing list