[Numpy-discussion] Correlation filter

josef.pktd@gmai... josef.pktd@gmai...
Fri Nov 20 10:53:51 CST 2009

On Fri, Nov 20, 2009 at 10:51 AM, Keith Goodman <kwgoodman@gmail.com> wrote:
> I have a short 1d array x and a large 2d array y. I'd like to locate
> the places in the y array that are most like (correlated to) the x
> array.
> My first attempt, corr1, is too slow. My second attempt, corr2, is
> faster but still slow.
> I reuse the same y many times, so my third attempt will probably be to
> calculate the moving mean of y outside of the function. But before
> doing that, I was wondering if there is any existing code that could
> help me. It seems like this would be a common filter-type operation.
> Or could stacking several filter operations like mean and product do
> the trick?
> I don't need the actual correlation. I just need an output that
> preserves the ranking of the correlation. For benchmarking I am using
> x of shape (5,) and y of shape (500,500):
> x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
> y = np.random.randn(500, 500)
> def corr1(x, y):
>    d = np.nan * np.ones_like(y)
>    for i in range(y.shape[0]):
>        yi = y[i,:]
>        for j in range(x.shape[0]-1, y.shape[1]):
>            yj = yi[j+1-x.shape[0]:j+1]
>            d[i,j] = np.corrcoef(x, yj)[0,1]
>    return d
> def corr2(x, y):
>    dot = np.dot
>    sqrt = np.sqrt
>    d = np.nan * np.ones_like(y)
>    x = x - x.mean()
>    x /= x.std()
>    x = np.tile(x, (y.shape[0],1))
>    nx = x.shape[1]
>    one = np.ones(nx) / nx
>    for i in range(nx-1, y.shape[1]):
>        yi = y[:,i+1-nx:i+1]
>        yi = yi - dot(yi, one).reshape(-1,1)
>        yi /= sqrt(dot(yi*yi, one)).reshape(-1,1)
>        d[:,i] = dot(x * yi, one)
>    return d
>>> timeit corr1(x,y)
> 10 loops, best of 3: 13.3 s per loop
>>> timeit corr2(x,y)
> 10 loops, best of 3: 60.5 ms per loop
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

scipy.signal.correlate  would be fast, but it will not be easy to
subtract the correct moving mean. Subtracting a standard moving mean
would subtract different values for each observation in the window.

One possibility would be to look at a moving regression and just take
the estimated slope parameter. John D'Errico
(BSD licensed)
uses a very nice trick with pinv to get the filter to calculate a
moving slope coefficient. I read the source but didn't try it out and
didn't try to figure out how the pinv trick exactly works.
If this can be adapted to your case, then this would be the fastest I
can think of (pinv and scipy.signal.correlate would do everything in
C, or maybe one (500) loop might still be necessary)

For just getting a ranking on a linear relationship, there might be
other tricks possible, local linear regression, ... (?), but I never
tried. Also, I think with one time loop, you can do all cross section
regressions at the same time, without tricks.


More information about the NumPy-Discussion mailing list