[Numpy-discussion] Matrix rank default tolerance - is it too low?

Charles R Harris charlesr.harris@gmail....
Sun Jun 17 21:22:26 CDT 2012


On Sat, Jun 16, 2012 at 2:33 PM, Matthew Brett <matthew.brett@gmail.com>wrote:

> Hi,
>
> On Sat, Jun 16, 2012 at 8:03 PM, Matthew Brett <matthew.brett@gmail.com>
> wrote:
> > Hi,
> >
> > On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs@pobox.com> wrote:
> >> On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris
> >> <charlesr.harris@gmail.com> wrote:
> >>>
> >>>
> >>> On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <
> matthew.brett@gmail.com>
> >>> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> I noticed that numpy.linalg.matrix_rank sometimes gives full rank for
> >>>> matrices that are numerically rank deficient:
> >>>>
> >>>> If I repeatedly make random matrices, then set the first column to be
> >>>> equal to the sum of the second and third columns:
> >>>>
> >>>> def make_deficient():
> >>>>    X = np.random.normal(size=(40, 10))
> >>>>    deficient_X = X.copy()
> >>>>    deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2]
> >>>>    return deficient_X
> >>>>
> >>>> then the current numpy.linalg.matrix_rank algorithm returns full rank
> >>>> (10) in about 8 percent of cases (see appended script).
> >>>>
> >>>> I think this is a tolerance problem.  The ``matrix_rank`` algorithm
> >>>> does this by default:
> >>>>
> >>>> S = spl.svd(M, compute_uv=False)
> >>>> tol = S.max() * np.finfo(S.dtype).eps
> >>>> return np.sum(S > tol)
> >>>>
> >>>> I guess we'd we want the lowest tolerance that nearly always or always
> >>>> identifies numerically rank deficient matrices.  I suppose one way of
> >>>> looking at whether the tolerance is in the right range is to compare
> >>>> the calculated tolerance (``tol``) to the minimum singular value
> >>>> (``S.min()``) because S.min() in our case should be very small and
> >>>> indicate the rank deficiency. The mean value of tol / S.min() for the
> >>>> current algorithm, across many iterations, is about 2.8.  We might
> >>>> hope this value would be higher than 1, but not much higher, otherwise
> >>>> we might be rejecting too many columns.
> >>>>
> >>>> Our current algorithm for tolerance is the same as the 2-norm of M *
> >>>> eps.  We're citing Golub and Van Loan for this, but now I look at our
> >>>> copy (p 261, last para) - they seem to be suggesting using u * |M|
> >>>> where u = (p 61, section 2.4.2) eps /  2. (see [1]). I think the Golub
> >>>> and Van Loan suggestion corresponds to:
> >>>>
> >>>> tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2
> >>>>
> >>>> This tolerance gives full rank for these rank-deficient matrices in
> >>>> about 39 percent of cases (tol / S.min() ratio of 1.7)
> >>>>
> >>>> We see on p 56 (section 2.3.2) that:
> >>>>
> >>>> m, n = M.shape
> >>>> 1 / sqrt(n) . |M|_{inf} <= |M|_2
> >>>>
> >>>> So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n).
>  Setting:
> >>>>
> >>>> tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n)
> >>>>
> >>>> gives about 0.5 percent error (tol / S.min() of 4.4)
> >>>>
> >>>> Using the Mathworks threshold [2]:
> >>>>
> >>>> tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
> >>>>
> >>>> There are no false negatives (0 percent rank 10), but tol / S.min() is
> >>>> around 110 - so conservative, in this case.
> >>>>
> >>>> So - summary - I'm worrying our current threshold is too small,
> >>>> letting through many rank-deficient matrices without detection.  I may
> >>>> have misread Golub and Van Loan, but maybe we aren't doing what they
> >>>> suggest.  Maybe what we could use is either the MATLAB threshold or
> >>>> something like:
> >>>>
> >>>> tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n)
> >>>>
> >>>> - so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This
> >>>> gives 0 percent misses and tol / S.min() of 8.7.
> >>>>
> >>>> What do y'all think?
> >>>>
> >>>> Best,
> >>>>
> >>>> Matthew
> >>>>
> >>>> [1]
> >>>>
> http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilon
> >>>> [2] http://www.mathworks.com/help/techdoc/ref/rank.html
> >>>>
> >>>> Output from script:
> >>>>
> >>>> Percent undetected current: 9.8, tol / S.min(): 2.762
> >>>> Percent undetected inf norm: 39.1, tol / S.min(): 1.667
> >>>> Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367
> >>>> Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734
> >>>> Percent undetected MATLAB: 0.0, tol / S.min(): 110.477
> >>>>
> >>>> <script>
> >>>> import numpy as np
> >>>> import scipy.linalg as npl
> >>>>
> >>>> M = 40
> >>>> N = 10
> >>>>
> >>>> def make_deficient():
> >>>>    X = np.random.normal(size=(M, N))
> >>>>    deficient_X = X.copy()
> >>>>    if M > N: # Make a column deficient
> >>>>        deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2]
> >>>>    else: # Make a row deficient
> >>>>        deficient_X[0] = deficient_X[1] + deficient_X[2]
> >>>>    return deficient_X
> >>>>
> >>>> matrices = []
> >>>> ranks = []
> >>>> ranks_inf = []
> >>>> ranks_ub_inf = []
> >>>> ranks_ub2_inf = []
> >>>> ranks_mlab = []
> >>>> tols = np.zeros((1000, 6))
> >>>> for i in range(1000):
> >>>>    m = make_deficient()
> >>>>    matrices.append(m)
> >>>>    # The SVD tolerances
> >>>>    S = npl.svd(m, compute_uv=False)
> >>>>    S0 = S.max()
> >>>>    # u in Golub and Van Loan == numpy eps / 2
> >>>>    eps = np.finfo(m.dtype).eps
> >>>>    u = eps / 2
> >>>>    # Current numpy matrix_rank algorithm
> >>>>    ranks.append(np.linalg.matrix_rank(m))
> >>>>    # Which is the same as:
> >>>>    tol_s0 = S0 * eps
> >>>>    # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0))
> >>>>    # Golub and Van Loan suggestion
> >>>>    tol_inf = npl.norm(m, np.inf) * u
> >>>>    ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf))
> >>>>    # Upper bound of |X|_{inf}
> >>>>    tol_ub_inf = tol_s0 * np.sqrt(N) / 2
> >>>>    ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf))
> >>>>    # Times 2 fudge
> >>>>    tol_ub2_inf = tol_s0 * np.sqrt(N)
> >>>>    ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf))
> >>>>    # MATLAB algorithm
> >>>>    tol_mlab = tol_s0 * max(m.shape)
> >>>>    ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab))
> >>>>    # Collect tols
> >>>>    tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab,
> S.min()
> >>>>
> >>>> rel_tols = tols / tols[:, -1][:, None]
> >>>>
> >>>> fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f'
> >>>> max_rank = min(M, N)
> >>>> for name, ranks, mrt in zip(
> >>>>    ('current', 'inf norm', 'upper bound inf norm',
> >>>>     'upper bound inf norm * 2', 'MATLAB'),
> >>>>    (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab),
> >>>>    rel_tols.mean(axis=0)[:5]):
> >>>>    pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100
> >>>>    print fmt % (name, pcnt, mrt)
> >>>> </script>
> >>>
> >>>
> >>> The polynomial fitting uses eps times the largest array dimension for
> the
> >>> relative condition number. IIRC, that choice traces back to numerical
> >>> recipes.
> >
> > Chuck - sorry - I didn't understand what you were saying, and now I
> > think you were proposing the MATLAB algorithm.   I can't find that in
> > Numerical Recipes - can you?  It would be helpful as a reference.
> >
> >> This is the same as Matlab, right?
> >
> > Yes, I believe so, i.e:
> >
> > tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
> >
> > from my original email.
> >
> >> If the Matlab condition is the most conservative, then it seems like a
> >> reasonable choice -- conservative is good so long as your false
> >> positive rate doesn't become to high, and presumably Matlab has enough
> >> user experience to know whether the false positive rate is too high.
> >
> > Are we agreeing to go for the Matlab algorithm?
>
> As extra data, current Numerical Recipes (2007, p 67) appears to prefer:
>
> tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.)
>

That's interesting, as something like that with a square root was my first
choice for the least squares, but then someone mentioned the NR choice.
That was all on the mailing list way several years back when I was fixing
up the polynomial fitting routine. The NR reference is on page 517 of the
1986 edition (FORTRAN), which might be hard to come by these days ;)


> There's a discussion of algorithms in:
>
> @article{konstantinides1988statistical,
>  title={Statistical analysis of effective singular values in matrix
> rank determination},
>  author={Konstantinides, K. and Yao, K.},
>  journal={Acoustics, Speech and Signal Processing, IEEE Transactions on},
>  volume={36},
>  number={5},
>  pages={757--763},
>  year={1988},
>  publisher={IEEE}
> }
>
> Yes, restricted access:
> http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1585&tag=1
>
> Cheers,
>
>
Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20120617/3cc4fdd5/attachment-0001.html 


More information about the NumPy-Discussion mailing list