[Numpy-discussion] Matrix rank default tolerance - is it too low?
Matthew Brett
matthew.brett@gmail....
Sat Jul 14 20:03:06 CDT 2012
Hi,
On Tue, Jun 26, 2012 at 7:29 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
> Hi,
>
> On Tue, Jun 26, 2012 at 5:04 PM, Charles R Harris
> <charlesr.harris@gmail.com> wrote:
>>
>>
>> On Tue, Jun 26, 2012 at 5:46 PM, Matthew Brett <matthew.brett@gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> On Tue, Jun 26, 2012 at 4:39 PM, Benjamin Root <ben.root@ou.edu> wrote:
>>> >
>>> >
>>> > On Tuesday, June 26, 2012, Charles R Harris wrote:
>>> >>
>>> >>
>>> >>
>>> >> On Tue, Jun 26, 2012 at 3:42 PM, Matthew Brett
>>> >> <matthew.brett@gmail.com>
>>> >> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> On Mon, Jun 18, 2012 at 3:50 PM, Matthew Brett
>>> >> <matthew.brett@gmail.com>
>>> >> wrote:
>>> >> > Hi,
>>> >> >
>>> >> > On Sun, Jun 17, 2012 at 7:22 PM, Charles R Harris
>>> >> > <charlesr.harris@gmail.com> wrote:
>>> >> >>
>>> >> >>
>>> >> >> On Sat, Jun 16, 2012 at 2:33 PM, Matthew Brett
>>> >> >> <matthew.brett@gmail.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> On Sat, Jun 16, 2012 at 8:03 PM, Matthew Brett
>>> >> >>> <matthew.brett@gmail.com>
>>> >> >>> wrote:
>>> >> >>> > Hi,
>>> >> >>> >
>>> >> >>> > On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs@pobox.com>
>>> >> >>> > wrote:
>>> >> >>> >> On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris
>>> >> >>> >> <charlesr.harris@gmail.com> wrote:
>>> >> >>> >>>
>>> >> >>> >>>
>>> >> >>> >>> On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett
>>> One potential problem is that it implies that it will always be the
>>> same as any version of matlab's tolerance. What if they change it in
>>> a future release? How likely are we to even notice?
>>>
>>>
>>> >> >>> >>> <matthew.brett@gmail.com>
>>> >> >>> >>> wrote:
>>> >> >>> >>>>
>>> >> >>> >>>> Hi,
>>> >> >>> >>>>
>>> >> >>> >>>> I noticed that numpy.linalg.matrix_rank sometimes gives full
>>> >> >>> >>>> rank
>>> >> >>> >>>> for
>>> >> >>> >>>> matrices that are numerically rank deficient:
>>> >> >>> >>>>
>>> >> >>> >>>> If I repeatedly make random matrices, then set the first
>>> >> >>> >>>> column
>>> >> >>> >>>> to be
>>> >> >>> >>>> equal to the sum of the second and third columns:
>>> >> >>> >>>>
>>> >> >>> >>>> def make_deficient():
>>> >> >>> >>>> X = np.random.normal(size=(40, 10))
>>> >> >>> >>>> deficient_X = X.copy()
>>> >> >>> >>>> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2]
>>> >> >>> >>>> return deficient_X
>>> >> >>> >>>>
>>> >> >>> >>>> then the current numpy.linalg.matrix_rank algorithm returns
>>> >> >>> >>>> full
>>> >> >>> >>>> rank
>>> >> >>> >>>> (10) in about 8 percent of cases (see appended script).
>>> >> >>> >>>>
>>> >> >>> >>>> I think this is a tolerance problem. The ``matrix_rank``
>>> >> >>> >>>> algorithm
>>> >> >>> >>>> does this by default:
>>> >> >>> >>>>
>>> >> >>> >>>> S = spl.svd(M, compute_uv=False)
>>> >> >>> >>>> tol = S.max() * np.finfo(S.dtype).eps
>>> >> >>> >>>> return np.sum(S > tol)
>>> >> >>> >>>>
>>> >> >>> >>>> I guess we'd we want the lowest tolerance that nearly always
>>> >> >>> >>>> or
>>> >> >>> >>>> always
>>> >> >>> >>>> identifies numerically rank deficient matrices. I suppose one
>>> >> >>> >>>> way of
>>> >> >>> >>>> looking at whether the tolerance is in the right range is to
>>> >> >>> >>>> compare
>>> >> >>> >>>> the calculated tolerance (``tol``) to the minimum singular
>>> >> >>> >>>> value
>>> >> >>> >>>> (``S.min()``) because S.min() in our case should be very small
>>> >> >>> >>>> and
>>> >> >>> >>>> indicate the rank deficiency. The mean value of tol / S.min()
>>> >> >>> >>>> for
>>> >> >>> >>>> the
>>> >> >>> >>>> current algorithm, across many iterations, is about 2.8. We
>>> >> >>> >>>> might
>>> >> >>> >>>> hope this value would be higher than 1, but not much higher,
>>> >> >>> >>>> otherwise
>>> >> >>> >>>> we might be rejecting too many columns.
>>> >> >>> >>>>
>>> >> >>> >>>> Our current algorithm for tolerance is the same as the 2-norm
>>> >> >>> >>>> of
>>> >> >>> >>>> M *
>>> >> >>> >>>> eps. We're citing Golub and Van Loan for this, but now I look
>>> >> >>> >>>> at
>>> >> >>> >>>> our
>>> >> >>> >>>> copy (p 261, last para) - they seem to be suggesting using u *
>>> >> >>> >>>> |M|
>>> >> >>> >>>> where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think
>>> >> >>> >>>> the
>>> >> >>> >>>> Golub
>>> >>
>>> >>
>>> >> I'm fine with that, and agree that it is likely to lead to fewer folks
>>> >> wondering why Matlab and numpy are different. A good explanation in the
>>> >> function documentation would be useful.
>>> >>
>>> >> Chuck
>>> >>
>>> >
>>> > One potential problem is that it implies that it will always be the same
>>> > as
>>> > any version of matlab's tolerance. What if they change it in a future
>>> > release? How likely are we to even notice?
>>>
>>> I guess that matlab is unlikely to change for the same reason that we
>>> would be reluctant to change, once we've found an acceptable value.
>>>
>>> I was thinking that we would say something like:
>>>
>>> """
>>> The default tolerance is :
>>>
>>> tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
>>>
>>> This corresponds to the tolerance suggested in NR page X, and to the
>>> tolerance used by MATLAB at the time of writing (June 2012; see
>>> http://www.mathworks.com/help/techdoc/ref/rank.html).
>>> """
>>>
>>> I don't know whether we would want to track changes made by matlab -
>>> maybe we could have that discussion if they do change?
>>
>>
>> I wouldn't bother tracking Matlab, but I think the alternative threshold
>> could be mentioned in the notes. Something like
>>
>> A less conservative threshold is ...
>>
>> Maybe mention that because of numerical uncertainty there will always be a
>> chance that the computed rank could be wrong, but that with the conservative
>> threshold the rank is very unlikely to be less than the computed rank.
>
> Sounds good to me. Would anyone object to a pull request with these
> changes (matlab tolerance default, description in docstring)?
Pull request here:
https://github.com/numpy/numpy/pull/357
Cheers,
Matthew
More information about the NumPy-Discussion
mailing list