[Numpy-discussion] another discussion on numpy correlate (and convolution)
Fri Feb 22 09:12:59 CST 2013
On Thu, Feb 21, 2013 at 1:00 PM, Pierre Haessig
> Hi everybody,
> (just coming from a discussion on the performance of Matplotlib's
> (x)corr function which uses np.correlate)
> There have been already many discussions on how to compute
> (cross-)correlations of time-series in Python (like
> The discussion is spread between the various stakeholders (just to name
> some I've in mind : scipy, statsmodel, matplotlib, ...).
> There are basically 2 implementations : time-domain and frequency-domain
> (using fft + multiplication). My discussion is only on time-domain
> implementation. The key usecase which I feel is not well adressed today
> is when computing the (cross)correlation of two long timeseries on only
> a few lagpoints.
> For time-domain, one can either write its own implementation or rely on
> numpy.correlate. The latter rely on the fast C implementation
> _pyarray_correlate() in multiarraymodule.c
> Now, the signature of this function is meant to accept one of three
> *computation modes* ('valid', 'same', 'full'). Thoses modes make a lot
> of sense when using this function in the context of convoluting two
> signals (say an "input array" and a "impulse response array"). In the
> context of computing the (cross)correlation of two long timeseries on
> only a few lagpoints, those computation modes are unadapted and
> potentially leads to huge computational and memory wastes
> (for example :
> For some time, I was thinking the solution was to write a dedicated
> function in one of the stakeholder modules (but which ? statsmodel ?)
> but now I came to think numpy is the best place to put a change and this
> would quickly benefit to all stackeholders downstream.
> Indeed, I looked more carefully at the C _pyarray_correlate() function,
> and I've come to the conclusion that these three computation modes are
> an unnecessary restriction because the actual computation relies on a
> triple for loop
> which boundaries are governed by two integers `n_left` and `n_right`
> instead of the three modes.
> Maybe I've misunderstood the inner-working of this function, but I've
> the feeling that the ability to set these two integers directly instead
> of just the mode would open up the possibility to compute the
> correlation on only a few lagpoints.
> I'm fully aware that changing the signature of such a core numpy is
> out-of-question but I've got the feeling that a reasonably small some
> code refactor might lift the longgoing problem of (cross-)correlation
> computation. The Python np.correlate would require two additional
> keywords -`n_left` and `n_right`)which would override the `mode`
> keyword. Only the C function would need some more care to keep good
> backward compatibility in case it's used externally.
> What do other people think ?
I think it's a good idea. I didn't look at the implementation details.
In statsmodels we also use explicit loops (x[lag:] * y[:lag]).sum(0)
at places where we expect that in most cases we only need a few
correlations (<10 or <20).
(the fft and correlate versions are mainly used where we expect or
need a large number or the full number of valid correlations)
> NumPy-Discussion mailing list
More information about the NumPy-Discussion