[Scipy-tickets] [SciPy] #1389: scipy.stats distributions are slow

SciPy Trac scipy-tickets@scipy....
Mon Feb 21 05:34:39 CST 2011


#1389: scipy.stats distributions are slow
-------------------------+--------------------------------------------------
 Reporter:  jpaalasm     |       Owner:  somebody
     Type:  enhancement  |      Status:  new     
 Priority:  normal       |   Milestone:  0.10.0  
Component:  Other        |     Version:  0.8.0   
 Keywords:               |  
-------------------------+--------------------------------------------------

Comment(by josefpktd):

 The slow performance is the cost for handling generic distributions, your
 class works well *if* there are no shape parameters and the support is the
 real line and you don't need error checking. Error checking and
 broadcasting in the generic distribution is costly.

 what I usually do (the first point always, the rest case specific):

  * avoid calculations in loop that can be pushed outside
    - for Monte Carlo, I almost always pregenerate an array of rvs first,
 and take the
      rvs in the loop from the this array, which is also faster than your
 class
      {{{
      def time_vec_rvs():
          rvs = scipy.stats.norm.rvs(size=N)
          for i in xrange(N):
              _ = loc + scale*rvs[i]
      }}}
    - for recursive likelihood functions: calculate loc and scale in loop,
      calculate pdf outside

  * use numpy.random for standard distributions

  * use _pdf instead of pdf *if* I know it works for that distribution or I
 have
    checked that it works, for example, normal, t and most distribution
 with support
    on the real line.
    With a finite lower or upper bound (a, b), this might not work, and
    _pdf does not check that the values are valid (e.g. in the support, I
 think).
    Shape parameter also can get tricky and might need the generic checking
 code.
    This is essentially your case.

  * inline the _pdf: normal pdf is so simple and common that we usually
 inline the
    formula directly, additionally, inlining is easier in the multivariate
 case
    anyway.
    I don't think I ever inlined more than a few simple standard
 distributions.

 Since the entire design of the distributions relies on the generic
 framework, I don't see a way to improve this.

 There are other cases, that can be made faster for specific distributions,
 e.g. precalculating some parameters when using the generic rvs when numpy
 doesn't have random numbers for that distribution. But since it's also
 distribution specific, it doesn't fit in the current framework.

 Maybe it would be interesting to create a distributions_light.

 However, your class is useful in special code, when you know it applies,
 but it doesn't apply for all distributions. And without error checking,
 you don't know whether you get an exception, nans or just wrong numbers if
 the assumptions are incorrect.

-- 
Ticket URL: <http://projects.scipy.org/scipy/ticket/1389#comment:1>
SciPy <http://www.scipy.org>
SciPy is open-source software for mathematics, science, and engineering.


More information about the Scipy-tickets mailing list