[Scipy-tickets] [SciPy] #1389: scipy.stats distributions are slow
SciPy Trac
scipy-tickets@scipy....
Mon Feb 21 05:34:39 CST 2011
#1389: scipy.stats distributions are slow
-------------------------+--------------------------------------------------
Reporter: jpaalasm | Owner: somebody
Type: enhancement | Status: new
Priority: normal | Milestone: 0.10.0
Component: Other | Version: 0.8.0
Keywords: |
-------------------------+--------------------------------------------------
Comment(by josefpktd):
The slow performance is the cost for handling generic distributions, your
class works well *if* there are no shape parameters and the support is the
real line and you don't need error checking. Error checking and
broadcasting in the generic distribution is costly.
what I usually do (the first point always, the rest case specific):
* avoid calculations in loop that can be pushed outside
- for Monte Carlo, I almost always pregenerate an array of rvs first,
and take the
rvs in the loop from the this array, which is also faster than your
class
{{{
def time_vec_rvs():
rvs = scipy.stats.norm.rvs(size=N)
for i in xrange(N):
_ = loc + scale*rvs[i]
}}}
- for recursive likelihood functions: calculate loc and scale in loop,
calculate pdf outside
* use numpy.random for standard distributions
* use _pdf instead of pdf *if* I know it works for that distribution or I
have
checked that it works, for example, normal, t and most distribution
with support
on the real line.
With a finite lower or upper bound (a, b), this might not work, and
_pdf does not check that the values are valid (e.g. in the support, I
think).
Shape parameter also can get tricky and might need the generic checking
code.
This is essentially your case.
* inline the _pdf: normal pdf is so simple and common that we usually
inline the
formula directly, additionally, inlining is easier in the multivariate
case
anyway.
I don't think I ever inlined more than a few simple standard
distributions.
Since the entire design of the distributions relies on the generic
framework, I don't see a way to improve this.
There are other cases, that can be made faster for specific distributions,
e.g. precalculating some parameters when using the generic rvs when numpy
doesn't have random numbers for that distribution. But since it's also
distribution specific, it doesn't fit in the current framework.
Maybe it would be interesting to create a distributions_light.
However, your class is useful in special code, when you know it applies,
but it doesn't apply for all distributions. And without error checking,
you don't know whether you get an exception, nans or just wrong numbers if
the assumptions are incorrect.
--
Ticket URL: <http://projects.scipy.org/scipy/ticket/1389#comment:1>
SciPy <http://www.scipy.org>
SciPy is open-source software for mathematics, science, and engineering.
More information about the Scipy-tickets
mailing list