[Numpy-discussion] Identifying Colinear Columns of a Matrix

josef.pktd@gmai... josef.pktd@gmai...
Fri Aug 26 13:13:43 CDT 2011


On Fri, Aug 26, 2011 at 1:41 PM, Mark Janikas <mjanikas@esri.com> wrote:
> I wonder if my last statement is essentially the only answer... which I wanted to avoid...
>
> Should I just use combinations of the columns and try and construct the corrcoef() (then ID whether NaNs are present), or use the condition number to ID the singularity?  I just wanted to avoid the whole k! algorithm.
>
> MJ
>
> -----Original Message-----
> From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of Mark Janikas
> Sent: Friday, August 26, 2011 10:35 AM
> To: Discussion of Numerical Python
> Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix
>
> I actually use the VIF when the design matrix can be inverted.... I do it the quick and dirty way as opposed to the step regression:
>
> 1. Calc the correlation coefficient of the matrix (w/o the intercept)
> 2. Return the diagonal of the inversion of the correlation matrix in step 1.
>
> Again, the problem lies in the multiple column relationship... I wouldn't be able to run sub regressions at all when the columns are perfectly collinear.
>
> MJ
>
> -----Original Message-----
> From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of Skipper Seabold
> Sent: Friday, August 26, 2011 10:28 AM
> To: Discussion of Numerical Python
> Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix
>
> On Fri, Aug 26, 2011 at 1:10 PM, Mark Janikas <mjanikas@esri.com> wrote:
>> Hello All,
>>
>>
>>
>> I am trying to identify columns of a matrix that are perfectly collinear.
>> It is not that difficult to identify when two columns are identical are have
>> zero variance, but I do not know how to ID when the culprit is of a higher
>> order. i.e. columns 1 + 2 + 3 = column 4.  NUM.corrcoef(matrix.T) will
>> return NaNs when the matrix is singular, and LA.cond(matrix.T) will provide
>> a very large condition number.. But they do not tell me which columns are
>> causing the problem.   For example:
>>
>>
>>
>> zt = numpy. array([[ 1.  ,  1.  ,  1.  ,  1.  ,  1.  ],
>>
>>                            [ 0.25,  0.1 ,  0.2 ,  0.25,  0.5 ],
>>
>>                            [ 0.75,  0.9 ,  0.8 ,  0.75,  0.5 ],
>>
>>                            [ 3.  ,  8.  ,  0.  ,  5.  ,  0.  ]])
>>
>>
>>
>> How can I identify that columns 0,1,2 are the issue because: column 1 +
>> column 2 = column 0?
>>
>>
>>
>> Any input would be greatly appreciated.  Thanks much,
>>
>
> The way that I know to do this in a regression context for (near
> perfect) multicollinearity is VIF. It's long been on my todo list for
> statsmodels.
>
> http://en.wikipedia.org/wiki/Variance_inflation_factor
>
> Maybe there are other ways with decompositions. I'd be happy to hear about them.
>
> Please post back if you write any code to do this.

Partial answer in a different context. I have written a function that
only adds columns if they maintain invertibility, using brute force:
add each column sequentially, check whether the matrix is singular.
Don't add the columns that already included as linear combination.
(But this doesn't tell which columns are in the colinear vector.)

I did this for categorical variables, so sequence was predefined.

Just finding a non-singular subspace would be easier, PCA, SVD, or
scikits.learn matrix decomposition (?).

(factor models and Johansen's cointegration tests are also just doing
matrix decomposition that identify subspaces)

Maybe rotation in Factor Analysis is able to identify the vectors, but
I don't have much idea about that.

Josef

>
> Skipper
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>


More information about the NumPy-Discussion mailing list