[Numpy-discussion] another discussion on numpy correlate (and convolution)

josef.pktd@gmai... josef.pktd@gmai...
Fri Feb 22 09:12:59 CST 2013


On Thu, Feb 21, 2013 at 1:00 PM, Pierre Haessig
<pierre.haessig@crans.org> wrote:
> Hi everybody,
>
> (just coming from a discussion on the performance of Matplotlib's
> (x)corr function which uses np.correlate)
>
> There have been already many discussions on how to compute
> (cross-)correlations of time-series in Python (like
> http://stackoverflow.com/questions/6991471/computing-cross-correlation-function).
> The discussion is spread between the various stakeholders (just to name
> some I've in mind : scipy, statsmodel, matplotlib, ...).
>
> There are basically 2 implementations : time-domain and frequency-domain
> (using fft + multiplication). My discussion is only on time-domain
> implementation. The key usecase which I feel is not well adressed today
> is when computing the (cross)correlation of two long timeseries on only
> a few lagpoints.
>
> For time-domain, one can either write its own implementation or rely on
> numpy.correlate. The latter rely on the fast C implementation
> _pyarray_correlate() in multiarraymodule.c
> (https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiarraymodule.c#L1177)
>
> Now, the signature of this function is meant to accept one of three
> *computation modes* ('valid', 'same', 'full'). Thoses modes make a lot
> of sense when using this function in the context of convoluting two
> signals (say an "input array" and a "impulse response array"). In the
> context of computing the (cross)correlation of two long timeseries on
> only a few lagpoints, those computation modes are unadapted and
> potentially leads to huge computational and memory wastes
> (for example :
> https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/axes.py#L4332).
>
>
>
> For some time, I was thinking the solution was to write a dedicated
> function in one of the stakeholder modules (but which ? statsmodel ?)
> but now I came to think numpy is the best place to put a change and this
> would quickly benefit to all stackeholders downstream.
>
> Indeed, I looked more carefully at the C _pyarray_correlate() function,
> and I've come to the conclusion that these three computation modes are
> an unnecessary restriction because the actual computation relies on a
> triple for loop
> (https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiarraymodule.c#L1177)
> which boundaries are governed by two integers `n_left` and `n_right`
> instead of the three modes.
>
> Maybe I've misunderstood the inner-working of this function, but I've
> the feeling that the ability to set these two integers directly instead
> of just the mode would open up the possibility to compute the
> correlation on only a few lagpoints.
>
> I'm fully aware that changing the signature of such a core numpy is
> out-of-question but I've got the feeling that a reasonably small some
> code refactor might lift the longgoing problem of (cross-)correlation
> computation. The Python np.correlate would require two additional
> keywords -`n_left` and `n_right`)which would override the `mode`
> keyword. Only the C function would need some more care to keep good
> backward compatibility in case it's used externally.
>
>
>
> What do other people think ?

I think it's a good idea. I didn't look at the implementation details.

In statsmodels we also use explicit loops (x[lag:] * y[:lag]).sum(0)
at places where we expect that in most cases we only need a few
correlations (<10 or <20).
(the fft and correlate versions are mainly used where we expect or
need a large number or the full number of valid correlations)

Josef

>
> Best,
> Pierre
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>


More information about the NumPy-Discussion mailing list