[SciPy-User] peer review of scientific software

Matt Newville newville@cars.uchicago....
Fri Jun 7 07:03:01 CDT 2013


On Wed, Jun 5, 2013 at 5:08 PM, Nathaniel Smith <njs@pobox.com> wrote:
> On Wed, Jun 5, 2013 at 10:36 PM, Matt Newville
> <newville@cars.uchicago.edu> wrote:
>> The paper that Alan Isaac referred to that started this conversation
>> seemed to advocate for unit testing in the sense of "don't trust the
>> codes you're using, always test them".  At first reading, this seems
>> like good advice. Since unit testing (or, at least, the phrase) is
>> relatively new for software development, it gives the appearance of
>> being new advice.  But the authors damage their case by continuing on
>> by saying not to trust analysis tools built by other scientists based
>> on the reputation and prior use of thse tools.  Here, they expose the
>> weakness of favoring "unit tests" over "functional tests".  They are
>> essentially advocating throwing out decades of proven, tested work
>> (and claiming that the use of this work to date is not justified, as
>> it derives from un-due reputation of the authors of prior work) for a
>> fashionable new trend.  Science is deliberately conservative, and
>> telling scientists that unit testing is all the rage among the cool
>> programmers and they should jump on that bandwagon is not likely to
>> gain much traction.
>
> But... have you ever sat down and written tests for a piece of widely
> used academic software? (Not LAPACK, but some random large package
> that's widely used within a field but doesn't have a comprehensive
> test suite of its own.) Everyone I've heard of who's done this
> discovers bugs all over the place. Would you personally trip over them
> if you didn't test the code? Who knows, maybe not. And probably most
> of the rest -- off by one errors here and there, maybe an incorrect
> normalizing constant, etc., -- end up not mattering too much. Or maybe
> they do. How could you even tell?

Sorry for the delay in responding.

For some definition of 'widely used academic software' and some
definition of "unit testing", why yes, I have.  And I have found many
errors.  I use unit tests, and I am not saying they are bad.  I'm
saying that other testing methods are valid too.  Advocating for one
testing method to the exclusion of others is not a good idea.

I'm also defending the conservative scientist who finds the errors
that "end up not mattering much" as, well, not mattering so much,
until they are important, which would probably be a case where the
code was applied to a new category or range of problem, which previous
tests may not have covered.

The non-software analogy is have a very well calibrated and tested
meter over some range, and applying it to a new range.  It might fail
spectacularly, it might work very well, and it might work partially.
Applying tools to new problems is what scientific instruments (and
software) have to try to do.  They might not work as expected.  Unit
testing with inputs over the expected are not entirely useful here.

> You should absolutely check scipy.optimize.leastsq before using it!

Really?  Are you sure that is what you mean to say?

By "you" do you mean me, personally, or do you mean everyone using
scipy.optimize.leastsq?

If you mean me, personally, It turns out I have written tests
(functional) against the NIST test suite that Josef mentioned:
  https://github.com/newville/lmfit-py/blob/master/tests/fit_NIST_leastsq.py

The results are actually not so clear.  Most tests "pass" to very high
precision, some "pass", but at lower precision than the certified
values, and some do not do very well at all.  But then, the NIST test
suite is especially grueling.  I also believe that the certified NIST
values may have actually come from MINPACK-1.  In that case, the test
shows roughly that scipy.optimize.leastsq is as good as MINPACK-1,
which is saying something, but not very much.

But if you mean everyone, then I completely disagree.

The point of using scipy is that one can be reasonably sure it has
already been tested.  Of course there may be bugs.  What would the
tests that *everyone* writes be testing anyway?  If they just repeat
other tests, it proves little more than that they can write a test.

Should they also absolutely check numpy.sqrt?  Does numpy use the
underlying implementation from the C standard library for sqrt, or
does it have its own?  I don't know, but if you're suggesting that
everyone should test everything, I'm sure you can tell us where these
stray from the correct values by more than machine precision.

What should one not absolutely check?

I suspect you don't mean that everyone should absolutely test
everything, but what you wrote could easily be read that way.

> You could rewrite it too if you want, I guess, and if you write a
> thorough test suite it might even work out. But it's pretty bizarre to
> me to think that someone is going to think "ah-hah, writing my own
> code + test suite will be easier than just writing a test suite!" Sure
> some people are going to find ways to procrastinate on the real
> problem (*cough*grad students*cough*) and NIH ain't just a funding
> body. But that's totally orthogonal to whether tests are good.

But if you don't trust the other person's code, why would you even
bother testing it?    And yes, I think many people would think that
writing their own code would be easier and better than writing tests
for someone else's buggy code.

My reading of the Joppa et al paper is that a principal complaint of
theirs is that people use existing software packages based on things
like "reputation of the package author(s)", and "how many times its
been used in the literature".  They advocate being very skeptical of
such software.  This ignores any testing that has already gone into
the existing package -- indeed they imply that there probably is none.

But, the uses in the literature demonstrate that the results of the
library or package can work well, at least in some cases.  This is
"prior work", and ignoring it is not good.  Ignoring the existing
literature is a very common problem in science, as many people prefer
to spend a week in the lab to save them that hour in the library.  But
actually *advocating* for others to not use the existing literature or
existing packages is a terrible idea.

The balance, the pH meter, and the thermocouple were each, at one
point in time, sophisticated devices.  Now, not so much.  You check
the label, check that it is not obviously wrong, and believe its
results.  Of course, these instruments have intrinsic uncertainties,
and can be just wrong in certain cases, but you are not (usually)
better off building your own.  Similarly, the C compiler, the
quick-sort algorithm, and the fast Fourier transform.  The Joppa et al
paper can easily be read to say that scientists should not trust
LAPACK, FFTPACK, MINPACK-1.  It sounds very close to your saying "you
should absolutely check scipy.optimize.leastsq" and leaving it unclear
whether you mean "every scientist who ever uses it".  This "trust
nothing" approach could easily throw out the baby with the bathwater.
It is certainly not how science is actually done, because science
attempts to apply previous knowledge and methods to new problems,
while maintaining a healthy skepticism that previous knowledge and
methods may be flawed.

Again, unit testing is akin to checking your instruments are working
correctly.  Yes, this is important.  Functional testing *is* the
scientific method.

> Honestly I'm not even sure what unit-testing "bandwagon" you're
> talking about.

Again, I'm not opposed to unit testing, or any other testing method at
all, and find unit testing very useful (I was writing unit tests
yesterday, in fact, and may write more today).  But it appears to me
that some people are under the impression that a) if code has unit
tests it is bug free, and b) if code does not have unit tests, it is
full of bugs.  Both are wrong.

I take Jerome Kieffer's (always great to see synchrotron people here!)
story as a good illustration.  He didn't  test before using scipy.
When he found a problem, he first assumed it was in his code, and only
after some work found the problem was in scipy itself.   This is how
science works. Yes, it would have been better if the problem hadn't
existed, but now the problem has been fixed for later users. If Jerome
had trusted nothing, he would have had no reason to trust scipy, and
the bug in scipy may not have been found.   Finally, the fact that his
story of finding a bug in scipy was worth repeating suggest that the
number of bugs found per user is very low.

--Matt Newville


More information about the SciPy-User mailing list