[SciPy-User] Unit testing of Bayesian estimator
Mon Nov 9 12:02:38 CST 2009
On Mon, Nov 9, 2009 at 12:18 PM, Bruce Southey <firstname.lastname@example.org> wrote:
> On 11/08/2009 01:47 AM, Anne Archibald wrote:
>> 2009/11/7 Bruce Southey<email@example.com>:
>>> On Fri, Nov 6, 2009 at 6:13 PM, Anne Archibald
>>> <firstname.lastname@example.org> wrote:
>>>> I have implemented a simple Bayesian regression program (it takes
>>>> events modulo one and returns a posterior probability that the data is
>>>> phase-invariant plus a posterior distribution for two parameters
>>>> (modulation fraction and phase) in case there is modulation).
>>> I do not know your field, a little rusty on certain issues and I do
>>> not consider myself a Bayesian.
>>> Exactly what type of Bayesian did you use?
>>> I also do not know how you implemented it especially if it is
>>> empirical or Monte Carlo Markov Chains.
>> It's an ultra-simple toy problem, really: I did the numerical
>> integration in the absolute simplest way possible, by evaluating the
>> quantity to be evaluated on a grid and averaging. See github for
>> I can certainly improve on this, but I'd rather get my testing issues
>> sorted out first, so that I can test the tests, as it were, on an
>> implementation I'm reasonably confident is correct, before changing it
>> to a mathematically more subtle one.
> I do not know what you are trying to do with the code as it is not my
> area. But you are using some empirical Bayesian estimator
> (http://en.wikipedia.org/wiki/Empirical_Bayes_method) and thus you lose
> much of the value of Bayesian as you are only dealing with modal
> estimates. Really you should be obtaining the distribution of
> "Probability the signal is pulsed" not just the modal estimate.
>>>> rather new at this, so I'd like to construct some unit tests. Does
>>>> anyone have any suggestions on how to go about this?
>>> Since this is a test, the theoretical 'correctness' is irrelevant. So
>>> I would guess that you should use very informative priors and data
>>> with a huge amount of information. That should make the posterior have
>>> an extremely narrow range so your modal estimate is very close to the
>>> true value within a very small range.
>> This doesn't really test whether the estimator is doing a good job,
>> since if I throw mountains of information at it, even a rather badly
>> wrong implementation will eventually converge to the right answer.
>> (This is painful experience speaking.)
> Are you testing the code or the method?
> My understanding of unit tests is that they test the code not the
> method. Unit tests tell me that my code is working correctly but do not
> necessary tell me if the method is right always. For example, if I need
> to iterate to get a solution, my test could stop after 1 or 2 rounds
> before convergence because I know that rest will be correct if the first
> rounds are correct.
> Testing the algorithm is relatively easy because you just have to use
> sensitivity analysis. Basically just use multiple data sets that vary in
> the number of observations and parameters to see how well these work.
> The hard part is making sense of the numbers.
> Also note that you have some explicit assumptions involved like the type
> of prior distribution. These tend to limit what you can do because if
> these assume a uniform prior then you can not use a non-uniform data
> set. Well you can but unless the data dominates the prior you will most
> likely get a weird answer.
>> I disagree on the issue of theoretical correctness, though. The best
>> tests do exactly that: test the theoretical correctness of the routine
>> in question, ideally without any reference to the implementation. To
>> test the SVD, for example, you just test that the two matrices are
>> both orthogonal, and you test that multiplying them together with the
>> singular values between gives you your original matrix. If your
>> implementation passes this test, it is computing the SVD just fine, no
>> matter what it looks like inside.
> I agree that code must provide theoretical correctness. But I disagree
> that the code should always give the correct answer because the code is
> only as good as the algorithm.
>> With the frequentist signal-detection statistics I'm more familiar
>> with, I can write exactly this sort of test. I talk a little more
>> about it here:
>> This works too well, it turns out, to apply to scipy's K-S test or my
>> own Kuiper test, since their p-values are calculated rather
>> approximately, so they fail.
> Again, this is a failure of the algorithm not the code. Often
> statistical tests rely on large sample sizes or rely on the central
> limit theorem so these break down when the data is not well approximated
> by the normal distribution and when the sample size is small. In the
> blog, your usage of the chi-squared approximation is an example of this
> - it will be inappropriate for small sample size as well as when the
> true probability is very extreme (usually consider it valid between
> about 0.2 and 0.8 obviously depending on sample size).
>>> After that it really depends on the algorithm, the data used and what
>>> you need to test. Basically you just have to say given this set of
>>> inputs I get this 'result' that I consider reasonable. After all, if
>>> the implementation of algorithm works then it is most likely the
>>> inputs that are a problem. In statistics, problems usually enter
>>> because the desired model can not be estimated from the provided data.
>>> Separation of user errors from a bug in the code usually identified by
>>> fitting simpler or alternative models.
>> It's exactly the implementation I don't trust, here. I can scrutinize
>> the implementation all I like, but I'd really like an independent
>> check on my calculations, and staring at the code won't get me that.
> I think that you mean is the algorithm and I do agree that looking at
> the code will only tell you that you have implemented the algorithm and
> will not tell you if the algorithm can be trusted.
>>>> For a frequentist periodicity detector, the return value is a
>>>> probability that, given the null hypothesis is true, the statistic
>>>> would be this extreme. So I can construct a strong unit test by
>>>> generating a collection of data sets given the null hypothesis,
>>>> evaluating the statistic, and seeing whether the number that claim to
>>>> be significant at a 5% level is really 5%. (In fact I can use the
>>>> binomial distribution to get limits on the number of false positive.)
>>>> This gives me a unit test that is completely orthogonal to my
>>>> implementation, and that passes if and only if the code works. For a
>>>> Bayesian hypothesis testing setup, I don't really see how to do
>>>> something analogous.
>>>> I can generate non-modulated data sets and confirm that my code
>>>> returns a high probability that the data is not modulated, but how
>>>> high should I expect the probability to be? I can generate data sets
>>>> with models with known parameters and check that the best-fit
>>>> parameters are close to the known parameters - but how close? Even if
>>>> I do it many times, is the posterior mean unbiased? What about the
>>>> posterior mode or median? I can even generate models and then data
>>>> sets that are drawn from the prior distribution, but what should I
>>>> expect from the code output on such a data set? I feel sure there's
>>>> some test that verifies a statistical property of Bayesian
>>>> estimators/hypothesis testers, but I cant quite put my finger on it.
>>>> Suggestions welcome.
>>> Please do not mix Frequentist or Likelihood concepts with Bayesian.
>>> Also you never generate data for estimation from the prior
>>> distribution, you generate it from the posterior distribution as that
>>> is what your estimating.
>> Um. I would be picking models from the prior distribution, not data.
>> However I find the models, I have a well-defined way to generate data
>> from the model.
>> Why do you say it's a bad idea to mix Bayesian and frequentist
>> approaches? It seems to me that as I use them to try to answer similar
>> questions, it makes sense to compare them; and since I know how to
>> test frequentist estimators, it's worth seeing whether I can cast
>> Bayesian estimators in frequentist terms, at least for testing
> This is the fundamental difference between Bayesian and Frequentist
> approaches. In Bayesian, the posterior provides everything that you know
> about a parameter because it is a distribution. However, the modal
> parameter estimates should agree between both approaches.
>>> Really in Bayesian sense all this data generation is unnecessary
>>> because you have already calculated that information in computing the
>>> posteriors. The posterior of a parameter is a distribution not a
>>> single number so you just compare distributions. For example, you can
>>> compute modal values and construct Bayesian credible intervals of the
>>> parameters. These should make very strong sense to the original values
>> I take this to mean that I don't need to do simulations to get
>> credible intervals (while I normally would have to to get confidence
>> intervals), which I agree with. But this is a different question: I'm
>> talking about constructing a test by simulating the whole Bayesian
>> process and seeing whether it behaves as it should. The problem is
>> coming up with a sufficiently clear mathematical definition of
> In Bayesian, you should have the posterior distribution of the parameter
> which is far more than just the mode, mean and variance. So if the
> posterior is normal, then I know what the mean and variance should be
> and thus what the confidence interval should be.
>>> For Bayesian work, you must address the data and the priors. In
>>> particular, you need to be careful about the informativeness of the
>>> prior. You can get great results just because your prior was
>>> sufficiently informative but you can get great results because you
>>> data was very informative.
>>> Depending on how it was implemented, a improper prior can be an issue
>>> because these do not guarantee a proper posterior (but often do lead
>>> to proper posteriors). So if your posterior is improper then you are
>>> in a very bad situation and can lead to weird results some or all of
>>> the time.Some times this is can easily be fixed such as by putting
>>> bounds on flat priors. Whereas proper priors give proper posteriors.
>> Indeed. I think my priors are pretty safe: 50% chance it's pulsed,
>> flat priors in phase and pulsed fraction. In the long run I might want
>> a slightly smarter prior on pulsed fraction, but for the moment I
>> think it's fine.
> It is not about whether or not the prior are 'safe'. Rather it is the
> relative amount of information given.
> Also, from other areas, I tend to distrust anything that assumes a 50:50
> split because these often lead to special results that do not always
> occur. So you think great, my code is working as everything looks fine.
> Then everything crashes when you deviate from that assumption because
> the 50% probability 'hides' something.
> An example in my area, the formula (for genetic effect) is of the form:
> alpha=2*p*q*(a + d(q-p))
> where alpha, a and d are parameters and p and q are frequencies that add
> to one.
> We could mistakenly assume that a=alpha but that is only true if d=0 or
> when p=q=0.5. So we may start to get incorrect results when p is not
> equal to q and d is not zero.
>>> But as a final comment, it should not matter which approach you use as
>>> if you do not get what you simulated then either your code is wrong or
>>> you did not simulate what your code implements. (Surprising how
>>> frequent the latter is.)
>> This is a bit misleading. If I use a (fairly) small number of photons,
>> and/or a fairly small pulsed fraction, I should be astonished if I got
>> back the model parameters exactly. I know already that the data leave
>> a lot of room for slop, so what I am trying to test is how well this
>> Bayesian gizmo quantifies that slop.
> You should not be astonished to get the prior as that is exactly what
> should happen if the data contains no information.
> For your simplistic case, I would implore you to generate the posterior
> distribution of the probability that the signal is pulsed (yes, easier
> to say than do). If you can do that then you will not only you get the
> modal value but you can compute the area under that distribution that is
> say less than 0.5 or whatever threshold you want to use.
With the limitation of using only a single prior at the current stage, I think
Anne has already implemented the standard Bayesian estimation including
a nice graph of the joint posterior distribution of fraction and phase.
Whether the signal is pulsed or not is just a binary variable, and the posterior
belief is just the probability that it is pulsed. (I think that's correct, since
there is no additional level where the "population" distribution of pulsing
versus non-pulsing signals is a random variable.)
I think it is very useful to look at the sampling distribution of a Bayesian
estimator or test. For an individual Bayesian everything is summarized
in the posterior distribution, but how well are a 1000 Bayesian
econometricians/statisticians doing that look at a 1000 different stars,
especially if I'm not sure they programmed their code correctly?
At the end, it's only interesting if they have a lower MSE or a higher
power in the test, otherwise I just use a different algorithm.
I usually check and test the statistical properties of an algorithm
and not just whether it is correctly implemented. And I think Anne
is doing it in a similar way, checking whether the size of a
statistical test is ok. Of course, for applications where only
incorrectly sized (biased) tests are available, it is difficult to find
a good tightness of the unit tests.
> SciPy-User mailing list
More information about the SciPy-User