[Scipy-tickets] [SciPy] #1854: scipy.stats.gaussian_kde.integrate functions are slow

SciPy Trac scipy-tickets@scipy....
Fri Mar 1 02:04:34 CST 2013


#1854: scipy.stats.gaussian_kde.integrate functions are slow
-------------------------------------------------+--------------------------
 Reporter:  itissid                              |       Owner:  pv         
     Type:  enhancement                          |      Status:  new        
 Priority:  normal                               |   Milestone:  Unscheduled
Component:  scipy.special                        |     Version:  0.11.0     
 Keywords:  gaussian_kde, numerical integration  |  
-------------------------------------------------+--------------------------
 Environment: I am using scipy scipy-0.11.0-py2.7 prepackaged binary for my
 MacosX with python 2.7 installed from http://www.python.org/download/.

 I installed scipy through
 http://sourceforge.net/projects/scipy/files/scipy/
 numpy: http://sourceforge.net/projects/numpy/files/NumPy/

 ==========CASE=============

 I created a gaussian kde like so:
  {{{ density = scipy.stats.gaussian_kde(data) }}}

 where data was a one dimensional array of size ~10**6 with float values
 ranging between [0,100].  A few points about the data: On just binning the
 data by value, there were on an average about ~100 values per bin (mean =
 116, standard deviation of 20). Thus there were ~ 10000 bins.

 Now my requirement was to integrate the density using
 density.integrate_box_1d from lower limit: 0 and an upper_limit < 100.
 The upper_limit is an iterable and is monotonically increasing. But the
 integration was really slow.

 So I profiled the integrate_box_id functions and got the following profile
 for 40 iterations in a list comprehension and got this:

 [[BR]]

 Timer unit: 1e-06 s

 File: /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7
 /site-packages/scipy/stats/kde.py
 Function: integrate_box_1d at line 286
 Total time: 5.34844 s
 {{{
 Line #      Hits         Time  Per Hit   % Time  Line Contents
 ==============================================================
    286                                               @profile
    287                                               def
 integrate_box_1d(self, low, high):
    288                                                   """Computes the
 integral of a 1D pdf between two bounds.
    289
    290                                                   Parameters
    291                                                   ----------
    292                                                   low : scalar
    293                                                       Lower bound
 of integration.
    294                                                   high : scalar
    295                                                       Upper bound
 of integration.
    296
    297                                                   Returns
    298                                                   -------
    299                                                   value : scalar
    300                                                       The result of
 the integral.
    301
    302                                                   Raises
    303                                                   ------
    304                                                   ValueError : if
 the KDE is over more than one dimension.
    305
    306                                                   """
    307        40           58      1.4      0.0          if self.d != 1:
    308                                                       raise
 ValueError("integrate_box_1d() only handles 1D pdfs")
    309
    310        40         1190     29.8      0.0          stdev =
 ravel(sqrt(self.covariance))[0]
    311
    312        40       516462  12911.5      9.7          normalized_low =
 ravel((low - self.dataset) / stdev)
    313        40       513140  12828.5      9.6          normalized_high =
 ravel((high - self.dataset) / stdev)
    314
    315        40      2011208  50280.2     37.6          value =
 np.mean(special.ndtr(normalized_high) - \
    316        40      2306273  57656.8     43.1
 special.ndtr(normalized_low))
    317        40          113      2.8      0.0          return value


 [[BR]]

 File: validation.py
 Function: pool_f at line 49
 Total time: 5.47016 s

 [[BR]]

 Line #      Hits         Time  Per Hit   % Time  Line Contents
 ==============================================================
     49                                           @profile
     50                                           def
 pool_f(intensity_function, t):
     51        40      5470161 136754.0    100.0         return
 intensity_function.integrate_box_1d(0, t)
 ============================================
 }}}

 I used kernprof for this. On looking around in the underlying function it
 seems special.ndtr is implemented using  estimates of error functions
 which is really slow. Could this be an issue on macosx? And if not I want
 to make a feature request for a faster error function as an alternative

 NOTE: I cant memoize the values as my limit values are different every
 time so that optimization is out of the question

-- 
Ticket URL: <http://projects.scipy.org/scipy/ticket/1854>
SciPy <http://www.scipy.org>
SciPy is open-source software for mathematics, science, and engineering.


More information about the Scipy-tickets mailing list