[SciPy-User] Unit testing of Bayesian estimator
Sun Nov 8 01:47:04 CST 2009
2009/11/7 Bruce Southey <email@example.com>:
> On Fri, Nov 6, 2009 at 6:13 PM, Anne Archibald
> <firstname.lastname@example.org> wrote:
>> I have implemented a simple Bayesian regression program (it takes
>> events modulo one and returns a posterior probability that the data is
>> phase-invariant plus a posterior distribution for two parameters
>> (modulation fraction and phase) in case there is modulation).
> I do not know your field, a little rusty on certain issues and I do
> not consider myself a Bayesian.
> Exactly what type of Bayesian did you use?
> I also do not know how you implemented it especially if it is
> empirical or Monte Carlo Markov Chains.
It's an ultra-simple toy problem, really: I did the numerical
integration in the absolute simplest way possible, by evaluating the
quantity to be evaluated on a grid and averaging. See github for
I can certainly improve on this, but I'd rather get my testing issues
sorted out first, so that I can test the tests, as it were, on an
implementation I'm reasonably confident is correct, before changing it
to a mathematically more subtle one.
>> rather new at this, so I'd like to construct some unit tests. Does
>> anyone have any suggestions on how to go about this?
> Since this is a test, the theoretical 'correctness' is irrelevant. So
> I would guess that you should use very informative priors and data
> with a huge amount of information. That should make the posterior have
> an extremely narrow range so your modal estimate is very close to the
> true value within a very small range.
This doesn't really test whether the estimator is doing a good job,
since if I throw mountains of information at it, even a rather badly
wrong implementation will eventually converge to the right answer.
(This is painful experience speaking.)
I disagree on the issue of theoretical correctness, though. The best
tests do exactly that: test the theoretical correctness of the routine
in question, ideally without any reference to the implementation. To
test the SVD, for example, you just test that the two matrices are
both orthogonal, and you test that multiplying them together with the
singular values between gives you your original matrix. If your
implementation passes this test, it is computing the SVD just fine, no
matter what it looks like inside.
With the frequentist signal-detection statistics I'm more familiar
with, I can write exactly this sort of test. I talk a little more
about it here:
This works too well, it turns out, to apply to scipy's K-S test or my
own Kuiper test, since their p-values are calculated rather
approximately, so they fail.
> After that it really depends on the algorithm, the data used and what
> you need to test. Basically you just have to say given this set of
> inputs I get this 'result' that I consider reasonable. After all, if
> the implementation of algorithm works then it is most likely the
> inputs that are a problem. In statistics, problems usually enter
> because the desired model can not be estimated from the provided data.
> Separation of user errors from a bug in the code usually identified by
> fitting simpler or alternative models.
It's exactly the implementation I don't trust, here. I can scrutinize
the implementation all I like, but I'd really like an independent
check on my calculations, and staring at the code won't get me that.
>> For a frequentist periodicity detector, the return value is a
>> probability that, given the null hypothesis is true, the statistic
>> would be this extreme. So I can construct a strong unit test by
>> generating a collection of data sets given the null hypothesis,
>> evaluating the statistic, and seeing whether the number that claim to
>> be significant at a 5% level is really 5%. (In fact I can use the
>> binomial distribution to get limits on the number of false positive.)
>> This gives me a unit test that is completely orthogonal to my
>> implementation, and that passes if and only if the code works. For a
>> Bayesian hypothesis testing setup, I don't really see how to do
>> something analogous.
>> I can generate non-modulated data sets and confirm that my code
>> returns a high probability that the data is not modulated, but how
>> high should I expect the probability to be? I can generate data sets
>> with models with known parameters and check that the best-fit
>> parameters are close to the known parameters - but how close? Even if
>> I do it many times, is the posterior mean unbiased? What about the
>> posterior mode or median? I can even generate models and then data
>> sets that are drawn from the prior distribution, but what should I
>> expect from the code output on such a data set? I feel sure there's
>> some test that verifies a statistical property of Bayesian
>> estimators/hypothesis testers, but I cant quite put my finger on it.
>> Suggestions welcome.
> Please do not mix Frequentist or Likelihood concepts with Bayesian.
> Also you never generate data for estimation from the prior
> distribution, you generate it from the posterior distribution as that
> is what your estimating.
Um. I would be picking models from the prior distribution, not data.
However I find the models, I have a well-defined way to generate data
from the model.
Why do you say it's a bad idea to mix Bayesian and frequentist
approaches? It seems to me that as I use them to try to answer similar
questions, it makes sense to compare them; and since I know how to
test frequentist estimators, it's worth seeing whether I can cast
Bayesian estimators in frequentist terms, at least for testing
> Really in Bayesian sense all this data generation is unnecessary
> because you have already calculated that information in computing the
> posteriors. The posterior of a parameter is a distribution not a
> single number so you just compare distributions. For example, you can
> compute modal values and construct Bayesian credible intervals of the
> parameters. These should make very strong sense to the original values
I take this to mean that I don't need to do simulations to get
credible intervals (while I normally would have to to get confidence
intervals), which I agree with. But this is a different question: I'm
talking about constructing a test by simulating the whole Bayesian
process and seeing whether it behaves as it should. The problem is
coming up with a sufficiently clear mathematical definition of
> For Bayesian work, you must address the data and the priors. In
> particular, you need to be careful about the informativeness of the
> prior. You can get great results just because your prior was
> sufficiently informative but you can get great results because you
> data was very informative.
> Depending on how it was implemented, a improper prior can be an issue
> because these do not guarantee a proper posterior (but often do lead
> to proper posteriors). So if your posterior is improper then you are
> in a very bad situation and can lead to weird results some or all of
> the time.Some times this is can easily be fixed such as by putting
> bounds on flat priors. Whereas proper priors give proper posteriors.
Indeed. I think my priors are pretty safe: 50% chance it's pulsed,
flat priors in phase and pulsed fraction. In the long run I might want
a slightly smarter prior on pulsed fraction, but for the moment I
think it's fine.
> But as a final comment, it should not matter which approach you use as
> if you do not get what you simulated then either your code is wrong or
> you did not simulate what your code implements. (Surprising how
> frequent the latter is.)
This is a bit misleading. If I use a (fairly) small number of photons,
and/or a fairly small pulsed fraction, I should be astonished if I got
back the model parameters exactly. I know already that the data leave
a lot of room for slop, so what I am trying to test is how well this
Bayesian gizmo quantifies that slop.
> SciPy-User mailing list
More information about the SciPy-User