[Numpy-discussion] autocorrelation computation performance : use of np.correlate

Ralf Gommers ralf.gommers@googlemail....
Sat Feb 4 16:19:46 CST 2012


On Thu, Feb 2, 2012 at 1:58 AM, <josef.pktd@gmail.com> wrote:

> On Wed, Feb 1, 2012 at 6:48 PM, Benjamin Root <ben.root@ou.edu> wrote:
> >
> >
> > On Wednesday, February 1, 2012, Pierre Haessig <pierre.haessig@crans.org
> >
> > wrote:
> >> Hi,
> >>
> >> [I'm not sure whether this discussion belongs to numpy-discussion or
> >> scipy-dev]
> >>
> >> In day to day time series analysis I regularly need to look at the data
> >> autocorrelation ("acorr" or "acf" depending on the software package).
> >> The straighforward available function I have is matplotlib.pyplot.acorr.
> >> However, for a moderately long time series (say of length 10**5) it
> taking a
> >> huge time just to just dislays the autocorrelation values within a small
> >> range of time lags.
> >> The main reason being it is relying on np.correlate(x,x, mode=2) while
> >> only a few lags are needed.
> >> (I guess mode=2 is an (old fashioned?) way to set mode='full')
> >>
> >> I know that np.correlate performance issue has been discussed already,
> and
> >> there is a *somehow* related ticket
> >> (http://projects.scipy.org/numpy/ticket/1260). I noticed in the
> ticket's
> >> change number 2 the following comment by Josef : "Maybe a truncated
> >> convolution/correlation would be good". I'll come back to this soon.
> >>
> >> I made an example script "acf_timing.py" to start my point with some
> >> timing data :
> >>
> >> In Ipython:
> >>>>> run acf_timing.py # it imports statsmodel's acf + define 2 other acf
> >>>>> implementations + an example data 10**5 samples long
> >>
> >> %time l,c = mpl_acf(a, 10)
> >> CPU times: user 8.69 s, sys: 0.00 s, total: 8.69 s
> >> Wall time: 11.18 s # pretty long...
> >>
> >>  %time c = sm_acf(a, 10)
> >> CPU times: user 8.76 s, sys: 0.01 s, total: 8.78 s
> >> Wall time: 10.79 s # long as well. statsmodel has a similar underlying
> >> implementation
> >> #
> >>
> http://statsmodels.sourceforge.net/generated/scikits.statsmodels.tsa.stattools.acf.html#scikits.statsmodels.tsa.stattools.acf
> >>
> >> #Now, better option : use the fft convolution
> >> %time c=sm_acf(a, 10,fft=True)
> >> CPU times: user 0.03 s, sys: 0.01 s, total: 0.04 s
> >> Wall time: 0.07 s
> >> # Fast, but I'm not sure about the memory implication of using fft
> though.
> >>
> >> #The naive option : just compute the acf lags that are needed
> >> %time l,c = naive_acf(a, 10)
> >> CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
> >> Wall time: 0.01 s
> >> # Iterative computation. Pretty silly but very fast
> >> # (Now of course, this naive implementation won't scale nicely for a lot
> >> of lags)
>
> I don't think it's silly to have a short python loop, statsmodels
> actually uses the loop in the models, for example in yule_walker (and
> GLSAR), because in most statistical application I wouldn't expect a
> large number of lags. The time series models don't use the acov
> directly, but I think most of the time we just loop over the lags.
>
> >>
> >> Now comes (at last) the question : what should be done about this
> >> performance issue ?
> >>  - should there be a truncated np.convolve/np.correlate function, as
> Josef
> >> suggested ?
> >>  - or should people in need of autocorrelation find some workarounds
> >> because this usecase is not big enough to call for a change in
> np.convolve ?
> >>
> >> I really feel this question is about *where* a change should be
> >> implemented  (numpy, scipy.signal, maplotlib ?) so that it makes sense
> while
> >> not breaking 10^10 lines of numpy related code...
> >>
> >> Best,
> >> Pierre
> >>
> >>
> >
> > Speaking for matplotlib, the acorr() (and xcorr()) functions in mpl are
> > merely a convenience.  The proper place for any change would not be mpl
> > (although, we would certainly take advantage of any improved acorr() and
> > xcorr() that are made available in numpy.
>
> I also think that numpy or scipy would be the natural candidates for a
> correlate that works fast for an intermediate number of desired lags
> (but still short compared to length of data).
>
> scipy.signal is the right place I think. numpy shouldn't grow too many
functions like this.

Ralf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20120204/a40b4b64/attachment-0001.html 


More information about the NumPy-Discussion mailing list