[Numpy-discussion] Matrix rank default tolerance - is it too low?
Charles R Harris
charlesr.harris@gmail....
Tue Jun 26 16:57:32 CDT 2012
On Tue, Jun 26, 2012 at 3:42 PM, Matthew Brett <matthew.brett@gmail.com>wrote:
> Hi,
>
> On Mon, Jun 18, 2012 at 3:50 PM, Matthew Brett <matthew.brett@gmail.com>
> wrote:
> > Hi,
> >
> > On Sun, Jun 17, 2012 at 7:22 PM, Charles R Harris
> > <charlesr.harris@gmail.com> wrote:
> >>
> >>
> >> On Sat, Jun 16, 2012 at 2:33 PM, Matthew Brett <matthew.brett@gmail.com
> >
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> On Sat, Jun 16, 2012 at 8:03 PM, Matthew Brett <
> matthew.brett@gmail.com>
> >>> wrote:
> >>> > Hi,
> >>> >
> >>> > On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs@pobox.com>
> wrote:
> >>> >> On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris
> >>> >> <charlesr.harris@gmail.com> wrote:
> >>> >>>
> >>> >>>
> >>> >>> On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett
> >>> >>> <matthew.brett@gmail.com>
> >>> >>> wrote:
> >>> >>>>
> >>> >>>> Hi,
> >>> >>>>
> >>> >>>> I noticed that numpy.linalg.matrix_rank sometimes gives full rank
> for
> >>> >>>> matrices that are numerically rank deficient:
> >>> >>>>
> >>> >>>> If I repeatedly make random matrices, then set the first column
> to be
> >>> >>>> equal to the sum of the second and third columns:
> >>> >>>>
> >>> >>>> def make_deficient():
> >>> >>>> X = np.random.normal(size=(40, 10))
> >>> >>>> deficient_X = X.copy()
> >>> >>>> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2]
> >>> >>>> return deficient_X
> >>> >>>>
> >>> >>>> then the current numpy.linalg.matrix_rank algorithm returns full
> rank
> >>> >>>> (10) in about 8 percent of cases (see appended script).
> >>> >>>>
> >>> >>>> I think this is a tolerance problem. The ``matrix_rank``
> algorithm
> >>> >>>> does this by default:
> >>> >>>>
> >>> >>>> S = spl.svd(M, compute_uv=False)
> >>> >>>> tol = S.max() * np.finfo(S.dtype).eps
> >>> >>>> return np.sum(S > tol)
> >>> >>>>
> >>> >>>> I guess we'd we want the lowest tolerance that nearly always or
> >>> >>>> always
> >>> >>>> identifies numerically rank deficient matrices. I suppose one
> way of
> >>> >>>> looking at whether the tolerance is in the right range is to
> compare
> >>> >>>> the calculated tolerance (``tol``) to the minimum singular value
> >>> >>>> (``S.min()``) because S.min() in our case should be very small and
> >>> >>>> indicate the rank deficiency. The mean value of tol / S.min() for
> the
> >>> >>>> current algorithm, across many iterations, is about 2.8. We might
> >>> >>>> hope this value would be higher than 1, but not much higher,
> >>> >>>> otherwise
> >>> >>>> we might be rejecting too many columns.
> >>> >>>>
> >>> >>>> Our current algorithm for tolerance is the same as the 2-norm of
> M *
> >>> >>>> eps. We're citing Golub and Van Loan for this, but now I look at
> our
> >>> >>>> copy (p 261, last para) - they seem to be suggesting using u * |M|
> >>> >>>> where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the
> >>> >>>> Golub
> >>> >>>> and Van Loan suggestion corresponds to:
> >>> >>>>
> >>> >>>> tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2
> >>> >>>>
> >>> >>>> This tolerance gives full rank for these rank-deficient matrices
> in
> >>> >>>> about 39 percent of cases (tol / S.min() ratio of 1.7)
> >>> >>>>
> >>> >>>> We see on p 56 (section 2.3.2) that:
> >>> >>>>
> >>> >>>> m, n = M.shape
> >>> >>>> 1 / sqrt(n) . |M|_{inf} <= |M|_2
> >>> >>>>
> >>> >>>> So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n).
> >>> >>>> Setting:
> >>> >>>>
> >>> >>>> tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n)
> >>> >>>>
> >>> >>>> gives about 0.5 percent error (tol / S.min() of 4.4)
> >>> >>>>
> >>> >>>> Using the Mathworks threshold [2]:
> >>> >>>>
> >>> >>>> tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
> >>> >>>>
> >>> >>>> There are no false negatives (0 percent rank 10), but tol /
> S.min()
> >>> >>>> is
> >>> >>>> around 110 - so conservative, in this case.
> >>> >>>>
> >>> >>>> So - summary - I'm worrying our current threshold is too small,
> >>> >>>> letting through many rank-deficient matrices without detection. I
> >>> >>>> may
> >>> >>>> have misread Golub and Van Loan, but maybe we aren't doing what
> they
> >>> >>>> suggest. Maybe what we could use is either the MATLAB threshold
> or
> >>> >>>> something like:
> >>> >>>>
> >>> >>>> tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n)
> >>> >>>>
> >>> >>>> - so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) .
> >>> >>>> This
> >>> >>>> gives 0 percent misses and tol / S.min() of 8.7.
> >>> >>>>
> >>> >>>> What do y'all think?
> >>> >>>>
> >>> >>>> Best,
> >>> >>>>
> >>> >>>> Matthew
> >>> >>>>
> >>> >>>> [1]
> >>> >>>>
> >>> >>>>
> http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilon
> >>> >>>> [2] http://www.mathworks.com/help/techdoc/ref/rank.html
> >>> >>>>
> >>> >>>> Output from script:
> >>> >>>>
> >>> >>>> Percent undetected current: 9.8, tol / S.min(): 2.762
> >>> >>>> Percent undetected inf norm: 39.1, tol / S.min(): 1.667
> >>> >>>> Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367
> >>> >>>> Percent undetected upper bound inf norm * 2: 0.0, tol / S.min():
> >>> >>>> 8.734
> >>> >>>> Percent undetected MATLAB: 0.0, tol / S.min(): 110.477
> >>> >>>>
> >>> >>>> <script>
> >>> >>>> import numpy as np
> >>> >>>> import scipy.linalg as npl
> >>> >>>>
> >>> >>>> M = 40
> >>> >>>> N = 10
> >>> >>>>
> >>> >>>> def make_deficient():
> >>> >>>> X = np.random.normal(size=(M, N))
> >>> >>>> deficient_X = X.copy()
> >>> >>>> if M > N: # Make a column deficient
> >>> >>>> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2]
> >>> >>>> else: # Make a row deficient
> >>> >>>> deficient_X[0] = deficient_X[1] + deficient_X[2]
> >>> >>>> return deficient_X
> >>> >>>>
> >>> >>>> matrices = []
> >>> >>>> ranks = []
> >>> >>>> ranks_inf = []
> >>> >>>> ranks_ub_inf = []
> >>> >>>> ranks_ub2_inf = []
> >>> >>>> ranks_mlab = []
> >>> >>>> tols = np.zeros((1000, 6))
> >>> >>>> for i in range(1000):
> >>> >>>> m = make_deficient()
> >>> >>>> matrices.append(m)
> >>> >>>> # The SVD tolerances
> >>> >>>> S = npl.svd(m, compute_uv=False)
> >>> >>>> S0 = S.max()
> >>> >>>> # u in Golub and Van Loan == numpy eps / 2
> >>> >>>> eps = np.finfo(m.dtype).eps
> >>> >>>> u = eps / 2
> >>> >>>> # Current numpy matrix_rank algorithm
> >>> >>>> ranks.append(np.linalg.matrix_rank(m))
> >>> >>>> # Which is the same as:
> >>> >>>> tol_s0 = S0 * eps
> >>> >>>> # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0))
> >>> >>>> # Golub and Van Loan suggestion
> >>> >>>> tol_inf = npl.norm(m, np.inf) * u
> >>> >>>> ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf))
> >>> >>>> # Upper bound of |X|_{inf}
> >>> >>>> tol_ub_inf = tol_s0 * np.sqrt(N) / 2
> >>> >>>> ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf))
> >>> >>>> # Times 2 fudge
> >>> >>>> tol_ub2_inf = tol_s0 * np.sqrt(N)
> >>> >>>> ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf))
> >>> >>>> # MATLAB algorithm
> >>> >>>> tol_mlab = tol_s0 * max(m.shape)
> >>> >>>> ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab))
> >>> >>>> # Collect tols
> >>> >>>> tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab,
> >>> >>>> S.min()
> >>> >>>>
> >>> >>>> rel_tols = tols / tols[:, -1][:, None]
> >>> >>>>
> >>> >>>> fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f'
> >>> >>>> max_rank = min(M, N)
> >>> >>>> for name, ranks, mrt in zip(
> >>> >>>> ('current', 'inf norm', 'upper bound inf norm',
> >>> >>>> 'upper bound inf norm * 2', 'MATLAB'),
> >>> >>>> (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab),
> >>> >>>> rel_tols.mean(axis=0)[:5]):
> >>> >>>> pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100
> >>> >>>> print fmt % (name, pcnt, mrt)
> >>> >>>> </script>
> >>> >>>
> >>> >>>
> >>> >>> The polynomial fitting uses eps times the largest array dimension
> for
> >>> >>> the
> >>> >>> relative condition number. IIRC, that choice traces back to
> numerical
> >>> >>> recipes.
> >>> >
> >>> > Chuck - sorry - I didn't understand what you were saying, and now I
> >>> > think you were proposing the MATLAB algorithm. I can't find that in
> >>> > Numerical Recipes - can you? It would be helpful as a reference.
> >>> >
> >>> >> This is the same as Matlab, right?
> >>> >
> >>> > Yes, I believe so, i.e:
> >>> >
> >>> > tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
> >>> >
> >>> > from my original email.
> >>> >
> >>> >> If the Matlab condition is the most conservative, then it seems
> like a
> >>> >> reasonable choice -- conservative is good so long as your false
> >>> >> positive rate doesn't become to high, and presumably Matlab has
> enough
> >>> >> user experience to know whether the false positive rate is too high.
> >>> >
> >>> > Are we agreeing to go for the Matlab algorithm?
> >>>
> >>> As extra data, current Numerical Recipes (2007, p 67) appears to
> prefer:
> >>>
> >>> tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.)
> >>
> >>
> >> That's interesting, as something like that with a square root was my
> first
> >> choice for the least squares, but then someone mentioned the NR choice.
> That
> >> was all on the mailing list way several years back when I was fixing up
> the
> >> polynomial fitting routine. The NR reference is on page 517 of the 1986
> >> edition (FORTRAN), which might be hard to come by these days ;)
> >
> > Thanks for tracking that down, it's very helpful.
> >
> > For those of you not near a huge University library or your own
> > private copy, p517 says:
> >
> > "A plausible answer to the question "how small is small", is to edit
> > in this fashion all singular values whose ratio to the largest
> > singular value is less then N times the machine precision \epsilon.
> > (You might argue for root N, or a constant, instead of N as the
> > multiple; that starts getting into hardware-dependent questions).
> >
> > Earlier (p510) we see the (General Linear Least Squares) problem being
> > set up as A = (N x M) where N >= M.
> >
> > The 2007 edition replaces the "(You might argue... )" text with: (p 795)
> >
> > "(This is a more conservative recommendation than the default in
> > section 2.6 which scales as N^{1/2})"
> >
> > and this in turn refers to the threshold:
> >
> > tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.)
> >
> > (p67) - which is justified as being (p71) " ... a default value based
> > on expected roundoff error".
> >
> > I could not at first glance see any other justification for this
> > threshold in the text.
> >
> > So, how about something like:
> >
> > def matrix_rank(M, tol='maxdim'):
> > ...
> >
> > tol: {'maxdim', 'nr-roundoff'} or float
> > If str, gives threshold strategy for tolerance relative to
> > the maximum singular value, explained below. If float gives absolute
> > tolerance below which singular values assumed zero. For the threshold
> > strategies, we will call the maximum singular value``S.max()`` and
> > the floating point epsilon for the working precision data type
> > ``eps``. Default strategy is 'maxdim' corresponding to ``tol =
> > S.max() * eps * max(M.shape)``. This is the MATLAB default; see also
> > Numerical Recipes 2007. Other options are 'nr-roundoff' (also from
> > Numerical Recipes 2007) corresponding to ``tol = S.max() * eps / 2 *
> > np.sqrt(M.shape[0] + M.shape[1] + 1)``.
> >
> > ?
>
> Thinking about this more, I'm tempted to just go for the matlab / old
> NR solution, that is:
>
> tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
>
> and not offer any other options. My reasoning is that the matrix_rank
> code is basically trivial; it is a convenience function. If someone
> wants a better matrix rank it's reasonable to copy these few lines
> into a new function. So providing multiple options for tolerance
> seems like overkill.
>
> Then the question is whether to use NR-2007:
>
> tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.)
>
> or the matlab default above. I'm leaning to the matlab version
> because we can be more or less sure that it does not miss actual
> numerical rank deficiency and it might be what people are expecting,
> if they shift from matlab or compare to matlab. The NR-2007 might
> well be closer to an accurate threshold for numerical rank deficiency,
> but I haven't tested it with all variants and all platforms, and the
> reduced false positives seems a minor gain compared to the slight
> increase in risk of false negatives and difference from matlab.
>
> What do y'all think,
>
>
I'm fine with that, and agree that it is likely to lead to fewer folks
wondering why Matlab and numpy are different. A good explanation in the
function documentation would be useful.
Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20120626/d2668130/attachment-0001.html
More information about the NumPy-Discussion
mailing list