[Numpy-discussion] Cross-covariance function

josef.pktd@gmai... josef.pktd@gmai...
Thu Jan 26 12:45:46 CST 2012


On Thu, Jan 26, 2012 at 1:25 PM, Bruce Southey <bsouthey@gmail.com> wrote:
> On Thu, Jan 26, 2012 at 10:07 AM, Pierre Haessig
> <pierre.haessig@crans.org> wrote:
>> Le 26/01/2012 15:57, Bruce Southey a écrit :
>>> Can you please provide a
>>> couple of real examples with expected output that clearly show what
>>> you want?
>>>
>> Hi Bruce,
>>
>> Thanks for your ticket feedback ! It's precisely because I see a big
>> potential impact of the proposed change that I send first a ML message,
>> second a ticket before jumping to a pull-request like a Sergio Leone's
>> cowboy (sorry, I watched "for a few dollars more" last weekend...)
>>
>> Now, I realize that in the ticket writing I made the wrong trade-off
>> between conciseness and accuracy which led to some of the errors you
>> raised. Let me try to use your example to try to share what I have in mind.
>>
>>> >> X = array([-2.1, -1. ,  4.3])
>>> >> Y = array([ 3.  ,  1.1 ,  0.12])
>>
>> Indeed, with today's cov behavior we have a 2x2 array:
>>> >> cov(X,Y)
>> array([[ 11.71      ,  -4.286     ],
>>        [ -4.286     ,   2.14413333]])
>>
>> Now, when I used the word 'concatenation', I wasn't precise enough
>> because I meant assembling X and Y in the sense of 2 vectors of
>> observations from 2 random variables X and Y.
>> This is achieved by concatenate(X,Y) *when properly playing with
>> dimensions* (which I didn't mentioned) :
>>> >> XY = np.concatenate((X[None, :], Y[None, :]))
>> array([[-2.1 , -1.  ,  4.3 ],
>>        [ 3.  ,  1.1 ,  0.12]])
>
> In this context, I find stacking,  np.vstack((X,Y)), more appropriate
> than concatenate.
>
>>
>> In this case, I can indeed say that "cov(X,Y) is equivalent to cov(XY)".
>>> >> np.cov(XY)
>> array([[ 11.71      ,  -4.286     ],
>>        [ -4.286     ,   2.14413333]])
>>
> Sure the resulting array is the same but whole process is totally different.
>
>
>> (And indeed, the actual cov Python code does use concatenate() )
> Yes, but the user does not see that. Whereas you are forcing the user
> to do the stacking in the correct dimensions.
>
>
>>
>>
>> Now let me come back to my assertion about this behavior *usefulness*.
>> You'll acknowledge that np.cov(XY) is made of four blocks (here just 4
>> simple scalars blocks).
> No there are not '4' blocks just rows and columns.

Sturla showed the 4 blocks in his first message.

>
>>  * diagonal blocks are just cov(X) and cov(Y) (which in this case comes
>> to var(X) and var(Y) when setting ddof to 1)
> Sure but variances are still covariances.
>
>>  * off diagonal blocks are symetric and are actually the covariance
>> estimate of X, Y observations (from
>> http://en.wikipedia.org/wiki/Covariance)
> Sure
>>
>> that is :
>>> >> ((X-X.mean()) * (Y-Y.mean())).sum()/ (3-1)
>> -4.2860000000000005
>>
>> The new proposed behaviour for cov is that cov(X,Y) would return :
>> array(-4.2860000000000005)  instead of the 2*2 matrix.
>
> But how you interpret an 2D array where the rows are greater than 2?
>>>> Z=Y+X
>>>> np.cov(np.vstack((X,Y,Z)))
> array([[ 11.71      ,  -4.286     ,   7.424     ],
>       [ -4.286     ,   2.14413333,  -2.14186667],
>       [  7.424     ,  -2.14186667,   5.28213333]])
>
>
>>
>>  * This would be in line with the cov(X,Y) mathematical definition, as
>> well as with R behavior.
> I don't care what R does because I am using Python and Python is
> infinitely better than R is!
>
> But I think that is only in the 1D case.

I just checked R to make sure I remember correctly

> xx = matrix((1:20)^2, nrow=4)
> xx
     [,1] [,2] [,3] [,4] [,5]
[1,]    1   25   81  169  289
[2,]    4   36  100  196  324
[3,]    9   49  121  225  361
[4,]   16   64  144  256  400
> cov(xx, 2*xx[,1:2])
         [,1]      [,2]
[1,]  86.0000  219.3333
[2,] 219.3333  566.0000
[3,] 352.6667  912.6667
[4,] 486.0000 1259.3333
[5,] 619.3333 1606.0000
> cov(xx)
         [,1]     [,2]      [,3]      [,4]      [,5]
[1,]  43.0000 109.6667  176.3333  243.0000  309.6667
[2,] 109.6667 283.0000  456.3333  629.6667  803.0000
[3,] 176.3333 456.3333  736.3333 1016.3333 1296.3333
[4,] 243.0000 629.6667 1016.3333 1403.0000 1789.6667
[5,] 309.6667 803.0000 1296.3333 1789.6667 2283.0000


>
>>  * This would save memory and computing resources. (and therefore help
>> save the planet ;-) )
> Nothing that you have provided shows that it will.

I don't know about saving the planet, but if X and Y have the same
number of columns, we save 3 quarters of the calculations, as Sturla
also explained in his first message.

Josef

>
>>
>> However, I do understand that the impact for this change may be big.
>> This indeed requires careful reviewing.
>>
>> Pierre
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
> Bruce
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion


More information about the NumPy-Discussion mailing list