# [SciPy-User] linear algebra: quadratic forms without linalg.inv

josef.pktd@gmai... josef.pktd@gmai...
Mon Nov 2 09:38:59 CST 2009

```On Mon, Nov 2, 2009 at 9:53 AM, Souheil Inati <souheil.inati@nyu.edu> wrote:
>
> On Nov 2, 2009, at 4:37 AM, Sturla Molden wrote:
>
>> josef.pktd@gmail.com skrev:
>>> Good, I didn't realize this when I worked on the eig and svd
>>> versions of
>>> the pca. In a similar way, I was initially puzzled that pinv can be
>>> used
>>> on the data matrix or on the covariance matrix (only the latter I
>>> have seen
>>> in books).
>>>
>>>
>>
>> I'll try to explain... If you have a matrix C, you can factorize like
>> this, with Sigma being a diagonal matrix:
>>
>>   C = U * Sigma * V'
>>
>>>>> u,s,vt = np.linalg.svd(c)
>>
>> If C is square (rank n x n), we now have the inverse
>>
>>   C**-1 = V * [S**-1] * U'
>>
>>>>> c_inv = np.mat(vt.T) * np.mat(np.eye(4)/s) * np.mat(u.T)
>>
>> And here you have the pathology diagnosis:
>>
>> A small value of s, will cause a huge value of 1/s. This is
>> "ill-conditioning" that e.g. happens with multicolinearity. You get a
>> small s, you divide by it, and rounding error skyrockets. We can
>> improve
>> the situation by editing the tiny values in Sigma to zero. That just
>> changes C by a tiny amount, but might have a dramatic stabilizing
>> effect
>> on C**-1. Now you can do your LU and not worry. It might not be clear
>> from statistics textbooks why multicolinearity is problem. But using
>> SVD, we see both the problem and the solution very clearly: A small
>> singular value might not contribute significantly to C, but could or
>> severly affect or even dominate in C**-1. We can thus get a biased but
>> numerically better approximation to C**-1 by deleting it from the
>> equation. So after editing s, we could e.g. do:
>>
>>>>> c_fixed = np.mat(u) * np.mat(np.eye(4)*s) * np.mat(vt)
>>
>> and continue with LU on c_fixed to get the quadratic form.
>>
>> Also beware that you can solve
>>
>>   C * x = b
>>
>> like this
>>
>>   x = (V * [S**-1]) * (U' * b)
>>
>> But if we are to reapeat this for several values of b, it would make
>> more sence to reconstruct C and go for the LU. Soving with LU also
>> involves two matrix multiplications:
>>
>>   L * y = b
>>   U * x = y
>>
>> but the computational demand is reduced by the triangular structure
>> of L
>> and U.
>>
>> Please don't say you'd rather preprocess data with a PCA. If C was a
>> covariance matrix, we just threw out the smallest principal components
>> out of the data. Deleting tiny singular values is in fact why PCA
>> helps!
>>
>> Also beware that
>>
>>   pca = lambda x : np.linalg.svd(x-x.mean(axis=0), full_matrices=0)
>>
>> So we can get PCA from SVD without even calculating the covariance.
>> Now
>> you have the standard deviations in Sigma, the principal components in
>> V, and the factor loadings in U. SVD is how PCA is usually computed.
>> It
>> is better than estimating Cov(X), and then apply Jacobi rotations to
>> get
>> the eigenvalues and eigenvectors of  Cov(X). One reason is that Cov(X)
>> should be estimated using a "two-pass algorithm" to cancel
>> accumulating
>> rounding error (Am Stat, 37: p.  242-247). But that equation is not
>> shown in most statistics textbooks, so most practitioners tend to not
>> know of it .
>> We can solve the common least squares problem using an SVD:
>>
>>   b = argmin { || X * b - Y  ||  **  2 }
>>
>> If we do an SVD of X, we can compute
>>
>>   b = sum( ((u[i,:] * Y )/s[i]) * vt[:,i].T )
>>
>> Unlike the other methods of fitting least squares, this one cannot
>> fail.
>> And you also see clearly what a PCA will do:
>>
>>   Skip "(u[i,:] * Y )/s[i]" for too small values of s[i]
>>
>> So you can preprocess with PCA anf fit LS in one shot.
>>
>> Ridge-regression (Tychonov regularization) is another solution to the
>> multicollinearity problem:
>>
>>    (A'A + lambda*I)*x = A'b
>>
>> But how would you choose the numerically optimal value of lambda? It
>> turns out to be a case of SVD as well. Goloub & van Loan has that on
>> page 583.
>>
>> QR with column pivoting can be seen as a case of SVD. Many use this
>> for
>> least-squares, not even knowing it is SVD. So SVD is ubiquitous in
>> data
>> modelling, even if you don't know it. :-)
>>
>> One more thing: The Cholesky factorization is always stabile, the LU
>> is
>> not. But don't be fooled: This only applies to the facotization
>> itself.
>> If you have multicolinearity, the problem is there even if you use
>> Cholesky. You get the "singular value disease" (astronomic rounding
>> error) when you solve the triangular system. A Cholesky can tell you
>> if
>> a covariance matrix is singular at your numerical precision. An SVD
>> can
>> tell you how close to singularity it is, and how to fix it. SVD
>> comes at
>> a cost, which is  slower  computation. But usually it is worth the
>> extra
>> investment in CPU cycles.
>>
>> Sturla Molden

Thanks Sturla,

I'm going to slowly work my way through this. at least I'm able now to
calculate a inverse matrix squareroot, which is useful for the quadratic
form and we will switch away from some of the remaining uses of the
matrix inverse in statsmodels.

>
>
> I agree with Sturla's comment's above 100%.    You should almost
> always use SVD to understand your linear system properties.   For
> least squares fitting QR is the modern, stable algorithm of choise.
> (see for example the matlab \ operator).   It's really a crime that we
> don't teach SVD and QR.
>
> There are two sources of error: 1. noise in the measurement and 2.
> noise in the numerics (rounding, division, etc.).   A properly
> constructed linear system solver will take care of the second type of
> error (rounding, etc.).  If your system is ill-conditioned, then you
> need to control the inversion so that the signal is maintained and the
> noise is not amplified too much.  In the overwhelming majority of
> applications, the SNR isn't better than 1000:1.  If you know your the
> relative size of your noise and signal, then you can control the SNR
> in your parameter estimates by choosing the svd truncation (noise
> amplification factor).
>
> For those of you that want an accessible reference for numerical
> stability in linear algebra, this book is a must read:
> Numerical Linear Algebra, Lloyd Trefethen
> http://www.amazon.com/Numerical-Linear-Algebra-Lloyd-Trefethen/dp/0898713617

It really depends on the application. From the applications I know,
pca is used for dimension reduction, when there are way too many
regressors to avoid overfitting. The most popular in econometrics
might be

Forecasting Using Principal Components from a Large Number of Predictors
# James H. Stock and Mark W. Watson
# Journal of the American Statistical Association, Vol. 97, No. 460
(Dec., 2002), pp. 1167-1179

A similar problem exists in chemometrics with more regressors than
NIPALS).

I don't think that compared to the big stochastic errors, numerical
precision plays much of a role. When we have large estimation errors
in small samples in statistics, we don't have to worry, for example,
about 10e-15 precision when our sampling errors are 10e-1.

Of course, there are other applications, and I'm working my way slowly
through the numerical issues.

Josef

>
> Cheers,
> Souheil
>
> ---------------------------------
>
> Souheil Inati, PhD
> Research Associate Professor
> Center for Neural Science and Department of Psychology
> Chief Physicist, NYU Center for Brain Imaging
> New York University
> 4 Washington Place, Room 809
> New York, N.Y., 10003-6621
> Office: (212) 998-3741
> Fax:     (212) 995-4011
> Email: souheil.inati@nyu.edu
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
```