# [SciPy-User] [R] Correlation coefficient of large data sets

Vincent Davis vincent@vincentdavis....
Tue Mar 16 08:18:32 CDT 2010

```>
>  @ Dennis

With 35000 variables at a time, the storage is under

20 Gb; you'd have to compute about 50 such chunks to get the entire matrix.

Is there a way to calculate a column or row of the correlation matrix one at
a time? I ma looking how including an additional set of observation effect
the correlation. For example if I have variables a,b,c,d..... and set of
observations 1-10 if the correlation is calculated for obs 1-5, I then add
observations 6-10 and what to know the average effect of this on the
correlation of c with (a,b,,d,e.....).
So I only need a column or a row at a time. Just not clear to me how I would
do this.

@Joshua Wiley

cor(my.data) # calculate the correlation matrix between all variables
> (columns) of my.data
>

*Vincent Davis
720-301-3003 *
vincent@vincentdavis.net
my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>

On Tue, Mar 16, 2010 at 12:06 AM, Joshua Wiley <jwiley.psych@gmail.com>wrote:

> I think what you have done should be fine.  read.table() will return a
> data frame, which cor() can handle happily.  For example:
>
> my.data <- read.table("file.csv", header = TRUE, row.names = 1,
> sep=",", strip.white = TRUE) # assign your data to "my.data"
>
> cor(my.data) # calculate the correlation matrix between all variables
> (columns) of my.data
>
> What happens if you try that?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20100316/94c076db/attachment.html
```

More information about the SciPy-User mailing list