[Numpy-discussion] Cross-covariance function
Bruce Southey
bsouthey@gmail....
Thu Jan 26 14:58:45 CST 2012
On Thu, Jan 26, 2012 at 12:45 PM, <josef.pktd@gmail.com> wrote:
> On Thu, Jan 26, 2012 at 1:25 PM, Bruce Southey <bsouthey@gmail.com> wrote:
>> On Thu, Jan 26, 2012 at 10:07 AM, Pierre Haessig
>> <pierre.haessig@crans.org> wrote:
>>> Le 26/01/2012 15:57, Bruce Southey a écrit :
>>>> Can you please provide a
>>>> couple of real examples with expected output that clearly show what
>>>> you want?
>>>>
>>> Hi Bruce,
>>>
>>> Thanks for your ticket feedback ! It's precisely because I see a big
>>> potential impact of the proposed change that I send first a ML message,
>>> second a ticket before jumping to a pull-request like a Sergio Leone's
>>> cowboy (sorry, I watched "for a few dollars more" last weekend...)
>>>
>>> Now, I realize that in the ticket writing I made the wrong trade-off
>>> between conciseness and accuracy which led to some of the errors you
>>> raised. Let me try to use your example to try to share what I have in mind.
>>>
>>>> >> X = array([-2.1, -1. , 4.3])
>>>> >> Y = array([ 3. , 1.1 , 0.12])
>>>
>>> Indeed, with today's cov behavior we have a 2x2 array:
>>>> >> cov(X,Y)
>>> array([[ 11.71 , -4.286 ],
>>> [ -4.286 , 2.14413333]])
>>>
>>> Now, when I used the word 'concatenation', I wasn't precise enough
>>> because I meant assembling X and Y in the sense of 2 vectors of
>>> observations from 2 random variables X and Y.
>>> This is achieved by concatenate(X,Y) *when properly playing with
>>> dimensions* (which I didn't mentioned) :
>>>> >> XY = np.concatenate((X[None, :], Y[None, :]))
>>> array([[-2.1 , -1. , 4.3 ],
>>> [ 3. , 1.1 , 0.12]])
>>
>> In this context, I find stacking, np.vstack((X,Y)), more appropriate
>> than concatenate.
>>
>>>
>>> In this case, I can indeed say that "cov(X,Y) is equivalent to cov(XY)".
>>>> >> np.cov(XY)
>>> array([[ 11.71 , -4.286 ],
>>> [ -4.286 , 2.14413333]])
>>>
>> Sure the resulting array is the same but whole process is totally different.
>>
>>
>>> (And indeed, the actual cov Python code does use concatenate() )
>> Yes, but the user does not see that. Whereas you are forcing the user
>> to do the stacking in the correct dimensions.
>>
>>
>>>
>>>
>>> Now let me come back to my assertion about this behavior *usefulness*.
>>> You'll acknowledge that np.cov(XY) is made of four blocks (here just 4
>>> simple scalars blocks).
>> No there are not '4' blocks just rows and columns.
>
> Sturla showed the 4 blocks in his first message.
>
Well, I could not follow that because the code is wrong.
X = np.array([-2.1, -1. , 4.3])
>>> cX = X - X.mean(axis=0)[np.newaxis,:]
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
cX = X - X.mean(axis=0)[np.newaxis,:]
IndexError: 0-d arrays can only use a single () or a list of newaxes
(and a single ...) as an index
whoops!
Anyhow, variance-covariance matrix is symmetric but numpy or scipy
lacks lapac's symmetrix matrix
(http://www.netlib.org/lapack/explore-html/de/d9e/group___s_y.html)
>>
>>> * diagonal blocks are just cov(X) and cov(Y) (which in this case comes
>>> to var(X) and var(Y) when setting ddof to 1)
>> Sure but variances are still covariances.
>>
>>> * off diagonal blocks are symetric and are actually the covariance
>>> estimate of X, Y observations (from
>>> http://en.wikipedia.org/wiki/Covariance)
>> Sure
>>>
>>> that is :
>>>> >> ((X-X.mean()) * (Y-Y.mean())).sum()/ (3-1)
>>> -4.2860000000000005
>>>
>>> The new proposed behaviour for cov is that cov(X,Y) would return :
>>> array(-4.2860000000000005) instead of the 2*2 matrix.
>>
>> But how you interpret an 2D array where the rows are greater than 2?
>>>>> Z=Y+X
>>>>> np.cov(np.vstack((X,Y,Z)))
>> array([[ 11.71 , -4.286 , 7.424 ],
>> [ -4.286 , 2.14413333, -2.14186667],
>> [ 7.424 , -2.14186667, 5.28213333]])
>>
>>
>>>
>>> * This would be in line with the cov(X,Y) mathematical definition, as
>>> well as with R behavior.
>> I don't care what R does because I am using Python and Python is
>> infinitely better than R is!
>>
>> But I think that is only in the 1D case.
>
> I just checked R to make sure I remember correctly
>
>> xx = matrix((1:20)^2, nrow=4)
>> xx
> [,1] [,2] [,3] [,4] [,5]
> [1,] 1 25 81 169 289
> [2,] 4 36 100 196 324
> [3,] 9 49 121 225 361
> [4,] 16 64 144 256 400
>> cov(xx, 2*xx[,1:2])
> [,1] [,2]
> [1,] 86.0000 219.3333
> [2,] 219.3333 566.0000
> [3,] 352.6667 912.6667
> [4,] 486.0000 1259.3333
> [5,] 619.3333 1606.0000
>> cov(xx)
> [,1] [,2] [,3] [,4] [,5]
> [1,] 43.0000 109.6667 176.3333 243.0000 309.6667
> [2,] 109.6667 283.0000 456.3333 629.6667 803.0000
> [3,] 176.3333 456.3333 736.3333 1016.3333 1296.3333
> [4,] 243.0000 629.6667 1016.3333 1403.0000 1789.6667
> [5,] 309.6667 803.0000 1296.3333 1789.6667 2283.0000
>
>
>>
>>> * This would save memory and computing resources. (and therefore help
>>> save the planet ;-) )
>> Nothing that you have provided shows that it will.
>
> I don't know about saving the planet, but if X and Y have the same
> number of columns, we save 3 quarters of the calculations, as Sturla
> also explained in his first message.
>
Can not figure those savings:
For a 2 by 2 output has 3 covariances (so 3/4 =0.75 is 'needed' not 25%)
a 3 by 3 output has 6 covariances
a 5 by 5 output 15 covariances
If you want to save memory and calculation then use symmetric storage
and associated methods.
Bruce
> Josef
>
>>
>>>
>>> However, I do understand that the impact for this change may be big.
>>> This indeed requires careful reviewing.
>>>
>>> Pierre
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>> Bruce
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
More information about the NumPy-Discussion
mailing list