[Numpy-discussion] another discussion on numpy correlate (and convolution)

Matthew Brett matthew.brett@gmail....
Fri Feb 22 10:40:46 CST 2013


Hi,

On Thu, Feb 21, 2013 at 10:00 AM, Pierre Haessig
<pierre.haessig@crans.org> wrote:
> Hi everybody,
>
> (just coming from a discussion on the performance of Matplotlib's
> (x)corr function which uses np.correlate)
>
> There have been already many discussions on how to compute
> (cross-)correlations of time-series in Python (like
> http://stackoverflow.com/questions/6991471/computing-cross-correlation-function).
> The discussion is spread between the various stakeholders (just to name
> some I've in mind : scipy, statsmodel, matplotlib, ...).
>
> There are basically 2 implementations : time-domain and frequency-domain
> (using fft + multiplication). My discussion is only on time-domain
> implementation. The key usecase which I feel is not well adressed today
> is when computing the (cross)correlation of two long timeseries on only
> a few lagpoints.
>
> For time-domain, one can either write its own implementation or rely on
> numpy.correlate. The latter rely on the fast C implementation
> _pyarray_correlate() in multiarraymodule.c
> (https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiarraymodule.c#L1177)
>
> Now, the signature of this function is meant to accept one of three
> *computation modes* ('valid', 'same', 'full'). Thoses modes make a lot
> of sense when using this function in the context of convoluting two
> signals (say an "input array" and a "impulse response array"). In the
> context of computing the (cross)correlation of two long timeseries on
> only a few lagpoints, those computation modes are unadapted and
> potentially leads to huge computational and memory wastes
> (for example :
> https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/axes.py#L4332).
>
>
>
> For some time, I was thinking the solution was to write a dedicated
> function in one of the stakeholder modules (but which ? statsmodel ?)
> but now I came to think numpy is the best place to put a change and this
> would quickly benefit to all stackeholders downstream.
>
> Indeed, I looked more carefully at the C _pyarray_correlate() function,
> and I've come to the conclusion that these three computation modes are
> an unnecessary restriction because the actual computation relies on a
> triple for loop
> (https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiarraymodule.c#L1177)
> which boundaries are governed by two integers `n_left` and `n_right`
> instead of the three modes.
>
> Maybe I've misunderstood the inner-working of this function, but I've
> the feeling that the ability to set these two integers directly instead
> of just the mode would open up the possibility to compute the
> correlation on only a few lagpoints.
>
> I'm fully aware that changing the signature of such a core numpy is
> out-of-question but I've got the feeling that a reasonably small some
> code refactor might lift the longgoing problem of (cross-)correlation
> computation. The Python np.correlate would require two additional
> keywords -`n_left` and `n_right`)which would override the `mode`
> keyword. Only the C function would need some more care to keep good
> backward compatibility in case it's used externally.

>From complete ignorance, do you think it is an option to allow a
(n_left, n_right) tuple as a value for 'mode'?

Cheers,

Matthew


More information about the NumPy-Discussion mailing list