[Numpy-discussion] Identifying Colinear Columns of a Matrix
Mark Janikas
mjanikas@esri....
Fri Aug 26 12:41:35 CDT 2011
I wonder if my last statement is essentially the only answer... which I wanted to avoid...
Should I just use combinations of the columns and try and construct the corrcoef() (then ID whether NaNs are present), or use the condition number to ID the singularity? I just wanted to avoid the whole k! algorithm.
MJ
-----Original Message-----
From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of Mark Janikas
Sent: Friday, August 26, 2011 10:35 AM
To: Discussion of Numerical Python
Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix
I actually use the VIF when the design matrix can be inverted.... I do it the quick and dirty way as opposed to the step regression:
1. Calc the correlation coefficient of the matrix (w/o the intercept)
2. Return the diagonal of the inversion of the correlation matrix in step 1.
Again, the problem lies in the multiple column relationship... I wouldn't be able to run sub regressions at all when the columns are perfectly collinear.
MJ
-----Original Message-----
From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of Skipper Seabold
Sent: Friday, August 26, 2011 10:28 AM
To: Discussion of Numerical Python
Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix
On Fri, Aug 26, 2011 at 1:10 PM, Mark Janikas <mjanikas@esri.com> wrote:
> Hello All,
>
>
>
> I am trying to identify columns of a matrix that are perfectly collinear.
> It is not that difficult to identify when two columns are identical are have
> zero variance, but I do not know how to ID when the culprit is of a higher
> order. i.e. columns 1 + 2 + 3 = column 4. NUM.corrcoef(matrix.T) will
> return NaNs when the matrix is singular, and LA.cond(matrix.T) will provide
> a very large condition number.. But they do not tell me which columns are
> causing the problem. For example:
>
>
>
> zt = numpy. array([[ 1. , 1. , 1. , 1. , 1. ],
>
> [ 0.25, 0.1 , 0.2 , 0.25, 0.5 ],
>
> [ 0.75, 0.9 , 0.8 , 0.75, 0.5 ],
>
> [ 3. , 8. , 0. , 5. , 0. ]])
>
>
>
> How can I identify that columns 0,1,2 are the issue because: column 1 +
> column 2 = column 0?
>
>
>
> Any input would be greatly appreciated. Thanks much,
>
The way that I know to do this in a regression context for (near
perfect) multicollinearity is VIF. It's long been on my todo list for
statsmodels.
http://en.wikipedia.org/wiki/Variance_inflation_factor
Maybe there are other ways with decompositions. I'd be happy to hear about them.
Please post back if you write any code to do this.
Skipper
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
More information about the NumPy-Discussion
mailing list