# [SciPy-User] Unit testing of Bayesian estimator

josef.pktd@gmai... josef.pktd@gmai...
Sun Nov 8 16:14:58 CST 2009

```On Sun, Nov 8, 2009 at 2:14 AM, Anne Archibald
<peridot.faceted@gmail.com> wrote:
> 2009/11/6  <josef.pktd@gmail.com>:
>> On Fri, Nov 6, 2009 at 7:13 PM, Anne Archibald
>> <aarchiba@physics.mcgill.ca> wrote:
>>> Hi,
>>>
>>> I have implemented a simple Bayesian regression program (it takes
>>> events modulo one and returns a posterior probability that the data is
>>> phase-invariant plus a posterior distribution for two parameters
>>> (modulation fraction and phase) in case there is modulation). I'm
>>> rather new at this, so I'd like to construct some unit tests. Does
>>>
>>> For a frequentist periodicity detector, the return value is a
>>> probability that, given the null hypothesis is true, the statistic
>>> would be this extreme. So I can construct a strong unit test by
>>> generating a collection of data sets given the null hypothesis,
>>> evaluating the statistic, and seeing whether the number that claim to
>>> be significant at a 5% level is really 5%. (In fact I can use the
>>> binomial distribution to get limits on the number of false positive.)
>>> This gives me a unit test that is completely orthogonal to my
>>> implementation, and that passes if and only if the code works. For a
>>> Bayesian hypothesis testing setup, I don't really see how to do
>>> something analogous.
>>>
>>> I can generate non-modulated data sets and confirm that my code
>>> returns a high probability that the data is not modulated, but how
>>> high should I expect the probability to be? I can generate data sets
>>> with models with known parameters and check that the best-fit
>>> parameters are close to the known parameters - but how close? Even if
>>> I do it many times, is the posterior mean unbiased? What about the
>>> posterior mode or median? I can even generate models and then data
>>> sets that are drawn from the prior distribution, but what should I
>>> expect from the code output on such a data set? I feel sure there's
>>> some test that verifies a statistical property of Bayesian
>>> estimators/hypothesis testers, but I cant quite put my finger on it.
>>
>>
>> The Bayesian experts are at pymc, maybe you can look at there tests
>> for inspiration. I don't know those, since I never looked at that
>> part.
>>
>> I never tried to test a Bayesian estimator but many properties are
>> still the same as in the non-Bayesian analysis. In my Bayesian past, I
>> essentially only used normal and t distributions, and binomial.
>>
>> One of my first tests for these things is to create a huge sample and
>> see whether the parameter estimates converge. With Bayesian analysis
>> you still have the law of large numbers, (for non-dogmatic priors)
>
> As far as getting the code roughly working, this is what I used; just
> run it generating lots of photons and see that roughly the right
> parameters come out. Unfortunately, this isn't really very sensitive
> to all the things that are supposed to make a Bayesian estimator
> better than (say) a maximum-likelihood estimator; I could have the
> probability estimation pretty badly wrong, but there are so many
> photons that anything but the right parameters are such a horrible fit
> even a somewhat wrong algorithm won't select them.
>
> Maybe you meant something different: I could also try fixing some
> model parameters and generating just a handful of photons, so I get a
> crummy estimate, but then repeating the photon generation and fit many
> times, to see if the average value of the best-fit parameters comes
> out close to the true parameters. But this is a test for unbiasedness
> of the estimator, and it's not clear that this estimator should be
> unbiased even if correct.

When I do a Monte Carlo for point estimates, I usually check bias,
variance, mean squared error,
and mean absolute and median absolute error (which is a more
robust to outliers, e.g. because for some cases the estimator produces
numerical nonsense because of non-convergence or other numerical
problems). MSE captures better cases of biased estimators that are
better in MSE sense.

I ran your test, test_bayes.py for M = 50, 500 and 1000 adding "return
in_interval_f"
and inside = test_credible_interval()

If my reading is correct inside should be 80% of M, and you are pretty close.
(M=1000 is pretty slow on my notebook)

>>> inside
39
>>> 39/50.
0.78000000000000003
>>>
>>> inside
410
>>> inside/500.
0.81999999999999995
>>>
>>> inside/1000.
0.81499999999999995

I haven't looked enough on the details yet, but I think this way you could
test more quantiles of the distribution, to see whether the posterior
distribution is roughly the same as the sampling distribution in the
MonteCarlo.

In each iteration of the Monte Carlo you get a full posterior distribution,
after a large number of iterations you have a sampling distribution,
and it should be possible to compare this distribution with the
posterior distributions. I'm still not sure how.

Isn't np.random.shuffle(r) redundant?
I didn't see anywhere were the sequence of observation in r would matter.

Why do you subtract mx in the loglikelihood function?
mx = np.amax(lpdf)
p = np.exp(lpdf - mx)/np.average(np.exp(lpdf-mx))

>
>> Do you have an example with a known posterior? Then, the posterior
>> with a large sample or the average in a Monte Carlo should still be
>> approximately the true one.
>> For symmetric distributions, the Bayesian posterior confidence
>> intervals and posterior mean should be roughly the same as the
>> frequentist estimates. With diffuse priors, in many cases the results
>> are exactly the same in Bayesian and MLE.
>> Another version I used in the past is to trace the posterior mean, as
>> the prior variance is reduced, in one extreme you should get the prior
>> back in the other extreme the MLE.
>
> My priors are flat, on (0,1) in both phase and pulsed fraction. It
> seems a bit peculiar to use anything else in phase, but I can imagine
> some sort of logarithmic prior for pulsed fraction (making 0.01-0.1
> equally likely to 0.1-1). I haven't experimented with introducing
> localized priors, but it seems like that too wouldn't be very
> sensitive to whether the Bayesian calculation is right; if the prior
> insists that the values are both 0.5, then any remotely sane algorithm
> will come up with posteriors that are also 0.5.

In the last sentence, you better hope that, if the true fraction is 0.1
than the posterior should be concentrated around 0.1 and not around
0.5. Right now you don't have an explicit prior, but once you use
one, you might want to test the effects of an informative prior.

For binomial (fraction) the natural prior is the beta distribution,
if I remember correctly. But I don't know if the marginal
posterior in this case would also be beta.

>
>>> I can even generate models and then data
>>> sets that are drawn from the prior distribution, but what should I
>>> expect from the code output on such a data set?
>>
>> If you ignore the Bayesian interpretation, then this is just a
>> standard sampling problem, you draw prior parameters and observations,
>> the rest is just finding the conditional and marginal probabilities. I
>> think the posterior odds ratio should converge in a large Monte Carlo
>> to the true one, and the significance levels should correspond to the
>> one that has been set for the test (5%).
>> (In simplest case of conjugate priors, you can just interpret the
>> prior as a previous sample and you are back to a frequentist
>> explanation.)
>
> This sounds like what I was trying for - draw a model according to the
> priors, then generate a data set according to the model. I then get
> some numbers out: the simplest is a probability that the model was
> pulsed, but I can also get a credible interval or an estimated CDF for
> the model parameters.  But I'm trying to figure out what test I should
> apply to those values to see if they make sense.
>
> For a credible interval, I suppose I could take (say) a 95% credible
> interval, then 95 times out of a hundred the model parameters I used
> to generate the data set should be in the credible interval. And I
> should be able to use the binomial distribution to put limits on how
> close to 95% I should get in M trials. This seems to work, but I'm not
> sure I understand why. The credible region is obtained from a
> probability distribution for the model parameters, but I am turning
> things around and testing the distribution of credible regions.

If you ignore the Bayesian belief interpretation, then it's just a
problem of Probability Theory, and you are just checking the
small and large sample behavior of an estimator and a test,
whether it has a Bayesian origin or not.

>
> In any case, that seems to work, so now I just need to figure out a
> similar test for the probability of being pulsed.

"probability of being pulsed"
I'm not sure what test you have in mind.
There are two interpretations:
In your current example, fraction is the fraction of observations that
are pulsed and fraction=0 is a zero probability event. So you cannot
really test fraction==0 versus fraction >0.

In the other interpretation you would have a prior probability (mass)
that your star is a pulsar with fraction >0 or a non-pulsing unit
with fraction=0.

The probabilities in both cases would be similar, but the interpretation
of the test would be different, and differ between frequentists and
Bayesians.

Overall your results look almost too "nice", with 8000 observations
you get a very narrow posterior in the plot.

Josef

>
>> The problem is that with an informative prior, you always have a
>> biased estimator in small samples and the posterior odds ratio is
>> affected by an informative prior.  And "real" Bayesians don't care
>>
>> What are your prior distributions and the likelihood function in your
>> case? Can you model degenerate and diffuse priors, so that an
>> informative prior doesn't influence you sampling results?
>> I'm trying to think of special cases where you could remove the effect
>> of the prior.
>
> I've put the code on github in case that helps make this any clearer:
> http://github.com/aarchiba/bayespf
>
> The model I'm using is either: completely uniform mod 1 (probability
> 0.5) or: the PDF is a cosine plus a constant, where the two parameters
> are the fraction of area under the cosine (as opposed to under the
> constant) and the phase offset of the cosine. The likelihood is just
> (modulo logs for range issues) the product over all observed phases x
> of PDF(fraction, phase, x). So the mode of the posterior is exactly
> the maximum-likelihood estimate (whether or not I got the math right,
> more or less).
>
>> It's a bit vague because I don't see the details, and I haven't looked
>> at this in a while.
>
> As is probably obvious, I'm pretty vague on Bayesian statistics in
> general. But I'm working on it.
>
> Anne
>
>>
>>
>>
>>
>>
>>
>>>
>>> Suggestions welcome.
>>>
>>> Thanks,
>>> Anne
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
```