[SciPy-User] Unit testing of Bayesian estimator
Mon Nov 9 11:18:41 CST 2009
On 11/08/2009 01:47 AM, Anne Archibald wrote:
> 2009/11/7 Bruce Southey<firstname.lastname@example.org>:
>> On Fri, Nov 6, 2009 at 6:13 PM, Anne Archibald
>> <email@example.com> wrote:
>>> I have implemented a simple Bayesian regression program (it takes
>>> events modulo one and returns a posterior probability that the data is
>>> phase-invariant plus a posterior distribution for two parameters
>>> (modulation fraction and phase) in case there is modulation).
>> I do not know your field, a little rusty on certain issues and I do
>> not consider myself a Bayesian.
>> Exactly what type of Bayesian did you use?
>> I also do not know how you implemented it especially if it is
>> empirical or Monte Carlo Markov Chains.
> It's an ultra-simple toy problem, really: I did the numerical
> integration in the absolute simplest way possible, by evaluating the
> quantity to be evaluated on a grid and averaging. See github for
> I can certainly improve on this, but I'd rather get my testing issues
> sorted out first, so that I can test the tests, as it were, on an
> implementation I'm reasonably confident is correct, before changing it
> to a mathematically more subtle one.
I do not know what you are trying to do with the code as it is not my
area. But you are using some empirical Bayesian estimator
(http://en.wikipedia.org/wiki/Empirical_Bayes_method) and thus you lose
much of the value of Bayesian as you are only dealing with modal
estimates. Really you should be obtaining the distribution of
"Probability the signal is pulsed" not just the modal estimate.
>>> rather new at this, so I'd like to construct some unit tests. Does
>>> anyone have any suggestions on how to go about this?
>> Since this is a test, the theoretical 'correctness' is irrelevant. So
>> I would guess that you should use very informative priors and data
>> with a huge amount of information. That should make the posterior have
>> an extremely narrow range so your modal estimate is very close to the
>> true value within a very small range.
> This doesn't really test whether the estimator is doing a good job,
> since if I throw mountains of information at it, even a rather badly
> wrong implementation will eventually converge to the right answer.
> (This is painful experience speaking.)
Are you testing the code or the method?
My understanding of unit tests is that they test the code not the
method. Unit tests tell me that my code is working correctly but do not
necessary tell me if the method is right always. For example, if I need
to iterate to get a solution, my test could stop after 1 or 2 rounds
before convergence because I know that rest will be correct if the first
rounds are correct.
Testing the algorithm is relatively easy because you just have to use
sensitivity analysis. Basically just use multiple data sets that vary in
the number of observations and parameters to see how well these work.
The hard part is making sense of the numbers.
Also note that you have some explicit assumptions involved like the type
of prior distribution. These tend to limit what you can do because if
these assume a uniform prior then you can not use a non-uniform data
set. Well you can but unless the data dominates the prior you will most
likely get a weird answer.
> I disagree on the issue of theoretical correctness, though. The best
> tests do exactly that: test the theoretical correctness of the routine
> in question, ideally without any reference to the implementation. To
> test the SVD, for example, you just test that the two matrices are
> both orthogonal, and you test that multiplying them together with the
> singular values between gives you your original matrix. If your
> implementation passes this test, it is computing the SVD just fine, no
> matter what it looks like inside.
I agree that code must provide theoretical correctness. But I disagree
that the code should always give the correct answer because the code is
only as good as the algorithm.
> With the frequentist signal-detection statistics I'm more familiar
> with, I can write exactly this sort of test. I talk a little more
> about it here:
> This works too well, it turns out, to apply to scipy's K-S test or my
> own Kuiper test, since their p-values are calculated rather
> approximately, so they fail.
Again, this is a failure of the algorithm not the code. Often
statistical tests rely on large sample sizes or rely on the central
limit theorem so these break down when the data is not well approximated
by the normal distribution and when the sample size is small. In the
blog, your usage of the chi-squared approximation is an example of this
- it will be inappropriate for small sample size as well as when the
true probability is very extreme (usually consider it valid between
about 0.2 and 0.8 obviously depending on sample size).
>> After that it really depends on the algorithm, the data used and what
>> you need to test. Basically you just have to say given this set of
>> inputs I get this 'result' that I consider reasonable. After all, if
>> the implementation of algorithm works then it is most likely the
>> inputs that are a problem. In statistics, problems usually enter
>> because the desired model can not be estimated from the provided data.
>> Separation of user errors from a bug in the code usually identified by
>> fitting simpler or alternative models.
> It's exactly the implementation I don't trust, here. I can scrutinize
> the implementation all I like, but I'd really like an independent
> check on my calculations, and staring at the code won't get me that.
I think that you mean is the algorithm and I do agree that looking at
the code will only tell you that you have implemented the algorithm and
will not tell you if the algorithm can be trusted.
>>> For a frequentist periodicity detector, the return value is a
>>> probability that, given the null hypothesis is true, the statistic
>>> would be this extreme. So I can construct a strong unit test by
>>> generating a collection of data sets given the null hypothesis,
>>> evaluating the statistic, and seeing whether the number that claim to
>>> be significant at a 5% level is really 5%. (In fact I can use the
>>> binomial distribution to get limits on the number of false positive.)
>>> This gives me a unit test that is completely orthogonal to my
>>> implementation, and that passes if and only if the code works. For a
>>> Bayesian hypothesis testing setup, I don't really see how to do
>>> something analogous.
>>> I can generate non-modulated data sets and confirm that my code
>>> returns a high probability that the data is not modulated, but how
>>> high should I expect the probability to be? I can generate data sets
>>> with models with known parameters and check that the best-fit
>>> parameters are close to the known parameters - but how close? Even if
>>> I do it many times, is the posterior mean unbiased? What about the
>>> posterior mode or median? I can even generate models and then data
>>> sets that are drawn from the prior distribution, but what should I
>>> expect from the code output on such a data set? I feel sure there's
>>> some test that verifies a statistical property of Bayesian
>>> estimators/hypothesis testers, but I cant quite put my finger on it.
>>> Suggestions welcome.
>> Please do not mix Frequentist or Likelihood concepts with Bayesian.
>> Also you never generate data for estimation from the prior
>> distribution, you generate it from the posterior distribution as that
>> is what your estimating.
> Um. I would be picking models from the prior distribution, not data.
> However I find the models, I have a well-defined way to generate data
> from the model.
> Why do you say it's a bad idea to mix Bayesian and frequentist
> approaches? It seems to me that as I use them to try to answer similar
> questions, it makes sense to compare them; and since I know how to
> test frequentist estimators, it's worth seeing whether I can cast
> Bayesian estimators in frequentist terms, at least for testing
This is the fundamental difference between Bayesian and Frequentist
approaches. In Bayesian, the posterior provides everything that you know
about a parameter because it is a distribution. However, the modal
parameter estimates should agree between both approaches.
>> Really in Bayesian sense all this data generation is unnecessary
>> because you have already calculated that information in computing the
>> posteriors. The posterior of a parameter is a distribution not a
>> single number so you just compare distributions. For example, you can
>> compute modal values and construct Bayesian credible intervals of the
>> parameters. These should make very strong sense to the original values
> I take this to mean that I don't need to do simulations to get
> credible intervals (while I normally would have to to get confidence
> intervals), which I agree with. But this is a different question: I'm
> talking about constructing a test by simulating the whole Bayesian
> process and seeing whether it behaves as it should. The problem is
> coming up with a sufficiently clear mathematical definition of
In Bayesian, you should have the posterior distribution of the parameter
which is far more than just the mode, mean and variance. So if the
posterior is normal, then I know what the mean and variance should be
and thus what the confidence interval should be.
>> For Bayesian work, you must address the data and the priors. In
>> particular, you need to be careful about the informativeness of the
>> prior. You can get great results just because your prior was
>> sufficiently informative but you can get great results because you
>> data was very informative.
>> Depending on how it was implemented, a improper prior can be an issue
>> because these do not guarantee a proper posterior (but often do lead
>> to proper posteriors). So if your posterior is improper then you are
>> in a very bad situation and can lead to weird results some or all of
>> the time.Some times this is can easily be fixed such as by putting
>> bounds on flat priors. Whereas proper priors give proper posteriors.
> Indeed. I think my priors are pretty safe: 50% chance it's pulsed,
> flat priors in phase and pulsed fraction. In the long run I might want
> a slightly smarter prior on pulsed fraction, but for the moment I
> think it's fine.
It is not about whether or not the prior are 'safe'. Rather it is the
relative amount of information given.
Also, from other areas, I tend to distrust anything that assumes a 50:50
split because these often lead to special results that do not always
occur. So you think great, my code is working as everything looks fine.
Then everything crashes when you deviate from that assumption because
the 50% probability 'hides' something.
An example in my area, the formula (for genetic effect) is of the form:
alpha=2*p*q*(a + d(q-p))
where alpha, a and d are parameters and p and q are frequencies that add
We could mistakenly assume that a=alpha but that is only true if d=0 or
when p=q=0.5. So we may start to get incorrect results when p is not
equal to q and d is not zero.
>> But as a final comment, it should not matter which approach you use as
>> if you do not get what you simulated then either your code is wrong or
>> you did not simulate what your code implements. (Surprising how
>> frequent the latter is.)
> This is a bit misleading. If I use a (fairly) small number of photons,
> and/or a fairly small pulsed fraction, I should be astonished if I got
> back the model parameters exactly. I know already that the data leave
> a lot of room for slop, so what I am trying to test is how well this
> Bayesian gizmo quantifies that slop.
You should not be astonished to get the prior as that is exactly what
should happen if the data contains no information.
For your simplistic case, I would implore you to generate the posterior
distribution of the probability that the signal is pulsed (yes, easier
to say than do). If you can do that then you will not only you get the
modal value but you can compute the area under that distribution that is
say less than 0.5 or whatever threshold you want to use.
More information about the SciPy-User