[Numpy-discussion] autocorrelation computation performance : use of np.correlate
Wed Feb 1 18:58:28 CST 2012
On Wed, Feb 1, 2012 at 6:48 PM, Benjamin Root <firstname.lastname@example.org> wrote:
> On Wednesday, February 1, 2012, Pierre Haessig <email@example.com>
>> [I'm not sure whether this discussion belongs to numpy-discussion or
>> In day to day time series analysis I regularly need to look at the data
>> autocorrelation ("acorr" or "acf" depending on the software package).
>> The straighforward available function I have is matplotlib.pyplot.acorr.
>> However, for a moderately long time series (say of length 10**5) it taking a
>> huge time just to just dislays the autocorrelation values within a small
>> range of time lags.
>> The main reason being it is relying on np.correlate(x,x, mode=2) while
>> only a few lags are needed.
>> (I guess mode=2 is an (old fashioned?) way to set mode='full')
>> I know that np.correlate performance issue has been discussed already, and
>> there is a *somehow* related ticket
>> (http://projects.scipy.org/numpy/ticket/1260). I noticed in the ticket's
>> change number 2 the following comment by Josef : "Maybe a truncated
>> convolution/correlation would be good". I'll come back to this soon.
>> I made an example script "acf_timing.py" to start my point with some
>> timing data :
>> In Ipython:
>>>>> run acf_timing.py # it imports statsmodel's acf + define 2 other acf
>>>>> implementations + an example data 10**5 samples long
>> %time l,c = mpl_acf(a, 10)
>> CPU times: user 8.69 s, sys: 0.00 s, total: 8.69 s
>> Wall time: 11.18 s # pretty long...
>> %time c = sm_acf(a, 10)
>> CPU times: user 8.76 s, sys: 0.01 s, total: 8.78 s
>> Wall time: 10.79 s # long as well. statsmodel has a similar underlying
>> #Now, better option : use the fft convolution
>> %time c=sm_acf(a, 10,fft=True)
>> CPU times: user 0.03 s, sys: 0.01 s, total: 0.04 s
>> Wall time: 0.07 s
>> # Fast, but I'm not sure about the memory implication of using fft though.
>> #The naive option : just compute the acf lags that are needed
>> %time l,c = naive_acf(a, 10)
>> CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
>> Wall time: 0.01 s
>> # Iterative computation. Pretty silly but very fast
>> # (Now of course, this naive implementation won't scale nicely for a lot
>> of lags)
I don't think it's silly to have a short python loop, statsmodels
actually uses the loop in the models, for example in yule_walker (and
GLSAR), because in most statistical application I wouldn't expect a
large number of lags. The time series models don't use the acov
directly, but I think most of the time we just loop over the lags.
>> Now comes (at last) the question : what should be done about this
>> performance issue ?
>> - should there be a truncated np.convolve/np.correlate function, as Josef
>> suggested ?
>> - or should people in need of autocorrelation find some workarounds
>> because this usecase is not big enough to call for a change in np.convolve ?
>> I really feel this question is about *where* a change should be
>> implemented (numpy, scipy.signal, maplotlib ?) so that it makes sense while
>> not breaking 10^10 lines of numpy related code...
> Speaking for matplotlib, the acorr() (and xcorr()) functions in mpl are
> merely a convenience. The proper place for any change would not be mpl
> (although, we would certainly take advantage of any improved acorr() and
> xcorr() that are made available in numpy.
I also think that numpy or scipy would be the natural candidates for a
correlate that works fast for an intermediate number of desired lags
(but still short compared to length of data).
> Ben Root
> NumPy-Discussion mailing list
More information about the NumPy-Discussion