# [SciPy-User] Correlation coefficient of large arrays

josef.pktd@gmai... josef.pktd@gmai...
Tue Mar 16 08:38:17 CDT 2010

```On Tue, Mar 16, 2010 at 9:21 AM, Vincent Davis <vincent@vincentdavis.net>wrote:

> Is there a way to calculate a column or row of the correlation matrix one
> at a time?  I ma looking how including an additional set of observation
> effect the correlation. For example if I have variables a,b,c,d..... and set
> of observations 1-10 if the correlation is calculated for obs 1-5, I then
> add observations 6-10 and what to know the average effect of this on the
> correlation of c with (a,b,,d,e.....).
> So I only need a column or a row at a time.
> Just not clear to me how I would do this. I guess I just need to just DO
> IT.
>

I would loop by variable not by observations

example in attachment

Josef

>
>   *Vincent Davis
> 720-301-3003 *
> vincent@vincentdavis.net
>
>
> On Mon, Mar 15, 2010 at 11:56 PM, <josef.pktd@gmail.com> wrote:
>
>>
>>
>> On Tue, Mar 16, 2010 at 1:39 AM, Vincent Davis <vincent@vincentdavis.net>wrote:
>>
>>>  @Josef
>>>
>>> how much memory does a
>>>
>>> >>> 230000**2 = 52900000000L  float (double) array take ?
>>>
>>>
>>>
>>> I guess I don't have a real appreciation for how large this is. I can do
>>> this numpy.ones((100000,50000),dtype=np.float64) and it uses about 85% of
>>> the memory I have available. But thats a long ways from 230,000X230,000. Of
>>> course the array is symmetric.
>>>
>>> Is it feasible to do writing it to the disk?
>>> The end goal is to find the difference between two correlation arrays and
>>> then calculate the mean of each column. Which then leaves me with an array
>>> 1X230,000
>>>
>>
>> If you don't really care about the correlation matrix itself and only need
>> the column (or row) sum then I would just loop over it in batches and never
>> construct the full matrix.
>> e.g. take the first 1000 variables and calculate the correlation with all
>> variables (1000 * 230000 -> 1000 for sum)
>> and loop.
>> Not using np.corrcoef would avoid some duplicate calculations, but there
>> are still several intermediate arrays necessary. So maybe using pytables or
>> similar would still be better to avoid duplicate calculations.
>>
>> Josef
>>
>>
>>
>>>
>>>   *Vincent Davis
>>> 720-301-3003 *
>>> vincent@vincentdavis.net
>>>
>>>
>>> On Mon, Mar 15, 2010 at 11:16 PM, <josef.pktd@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Mar 16, 2010 at 1:04 AM, Vincent Davis <
>>>> vincent@vincentdavis.net> wrote:
>>>>
>>>>> I have an array 10 observations of 230,000 variables and what to find
>>>>> the correlation coefficient between each variable.
>>>>> numpy.corrcef(data) works except I can only do it with about 30,000
>>>>> variables at a time. numpy.corrcef(data[:30000]). It uses up a lot of
>>>>> memory.
>>>>> Is there a better way?
>>>>>
>>>>
>>>>
>>>> how much memory does a
>>>> >>> 230000**2
>>>> 52900000000L
>>>>
>>>> float (double) array take ?
>>>>
>>>> Josef
>>>> (I'm not going to try)
>>>>
>>>>
>>>>
>>>>>
>>>>>   *Vincent Davis
>>>>> 720-301-3003 *
>>>>> vincent@vincentdavis.net
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> SciPy-User mailing list
>>>>> SciPy-User@scipy.org
>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> SciPy-User mailing list
>>>> SciPy-User@scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>
>>>>
>>>
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>>
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20100316/77550599/attachment.html
-------------- next part --------------
'''calculating correlation coefficients sequentially for large number of
variables

A: josef-pktd
'''

import numpy as np

x = np.random.randn(4,5)
print np.corrcoef(x, rowvar=False)

xzs = (x-x.mean(0))/x.std(0)  #standardize
nobs, nvars = xzs.shape
rowsum = np.empty(nvars)
# calculate correlation coefficient for each variable with all others
for col in xrange(nvars):
corr = np.dot(xzs[:,col],xzs)/nobs
rowsum[col] = corr.sum()
print corr

print '\nrowsums'
print rowsum
print np.corrcoef(x, rowvar=False).sum(1)
print np.max(np.abs(rowsum - np.corrcoef(x, rowvar=False).sum(1)))
```