[SciPy-User] Unit testing of Bayesian estimator

josef.pktd@gmai... josef.pktd@gmai...
Sun Nov 8 20:35:18 CST 2009


On Sun, Nov 8, 2009 at 5:51 PM, Anne Archibald
<peridot.faceted@gmail.com> wrote:
> 2009/11/8  <josef.pktd@gmail.com>:
>
>> When I do a Monte Carlo for point estimates, I usually check bias,
>> variance, mean squared error,
>> and mean absolute and median absolute error (which is a more
>> robust to outliers, e.g. because for some cases the estimator produces
>> numerical nonsense because of non-convergence or other numerical
>> problems). MSE captures better cases of biased estimators that are
>> better in MSE sense.
>
> I can certainly compute all these quantities from a collection of
> Monte Carlo runs, but I don't have any idea what values would indicate
> correctness, apart from "not too big".

I consider them mainly as an absolute standard to see how well
an estimator works (or what the size and power of a test is) or
to compare them to other estimators, which is a common case for
publishing Monte Carlo studies for new estimators.


>
>> I ran your test, test_bayes.py for M = 50, 500 and 1000 adding "return
>> in_interval_f"
>> and inside = test_credible_interval()
>>
>> If my reading is correct inside should be 80% of M, and you are pretty close.
>> (M=1000 is pretty slow on my notebook)
>
> Yeah, that's the problem with using the world's simplest numerical
> integration scheme.
>
>>>>> inside
>> 39
>>>>> 39/50.
>> 0.78000000000000003
>>>>>
>>>>> inside
>> 410
>>>>> inside/500.
>> 0.81999999999999995
>>>>>
>>>>> inside/1000.
>> 0.81499999999999995
>>
>> I haven't looked enough on the details yet, but I think this way you could
>> test more quantiles of the distribution, to see whether the posterior
>> distribution is roughly the same as the sampling distribution in the
>> MonteCarlo.
>
> I could test more quantiles, but I'm very distrustful of testing more
> than one quantile per randomly-generated sample: they should be
> covariant (if the 90% mark is too high, the 95% mark will almost
> certainly be too high as well) and I don't know how to take that into
> account. And running the test is currently so slow I'm inclined to
> spend my CPU time on a stricter test of a single quantile. Though
> unfortunately to increase the strictness I also need to improve the
> sampling in phase and fraction.

Adding additional quantiles might be relatively cheap, mainly the
call to searchsorted. One or two quantiles could be consistent
with many different distributions or e.g with fatter tails, so I usually
check more points.

>
>> In each iteration of the Monte Carlo you get a full posterior distribution,
>> after a large number of iterations you have a sampling distribution,
>> and it should be possible to compare this distribution with the
>> posterior distributions. I'm still not sure how.
>
> I don't understand what you mean here. I do get a full posterior
> distribution out of every simulation. But how would I combine these
> different distributions, and what would the combined distribution
> mean?

I'm still trying to think how this can be done. Checking more quantiles
as discussed above might be doing it to some extend.
(I also wonder whether it might be useful to fix the observations
during the monte carlo and only vary the sampling of the parameters ?)

>
>> two questions to your algorithm
>>
>> Isn't np.random.shuffle(r) redundant?
>> I didn't see anywhere were the sequence of observation in r would matter.
>
> It is technically redundant. But since the point of all this is that I
> don't trust my code to be right, I want to make sure there's no way it
> can "cheat" by taking advantage of the order. And in any case, the
> slow part is my far-too-simple numerical integration scheme. I'm
> pretty sure the phase integration, at least, could be done
> analytically.
>
>> Why do you subtract mx in the loglikelihood function?
>>    mx = np.amax(lpdf)
>>    p = np.exp(lpdf - mx)/np.average(np.exp(lpdf-mx))
>
> This is to avoid overflows. I could just use logsumexp/logaddexp, but
> that's not yet in numpy on any of the machines I regularly use. It has
> no effect on the value, since it's subtracted from top and bottom
> both, but it ensures that the largest value exponentiated is exactly
> zero.
>
>>>>> I can even generate models and then data
>>>>> sets that are drawn from the prior distribution, but what should I
>>>>> expect from the code output on such a data set?
>>>>
>>>> If you ignore the Bayesian interpretation, then this is just a
>>>> standard sampling problem, you draw prior parameters and observations,
>>>> the rest is just finding the conditional and marginal probabilities. I
>>>> think the posterior odds ratio should converge in a large Monte Carlo
>>>> to the true one, and the significance levels should correspond to the
>>>> one that has been set for the test (5%).
>>>> (In simplest case of conjugate priors, you can just interpret the
>>>> prior as a previous sample and you are back to a frequentist
>>>> explanation.)
>>>
>>> This sounds like what I was trying for - draw a model according to the
>>> priors, then generate a data set according to the model. I then get
>>> some numbers out: the simplest is a probability that the model was
>>> pulsed, but I can also get a credible interval or an estimated CDF for
>>> the model parameters.  But I'm trying to figure out what test I should
>>> apply to those values to see if they make sense.
>>>
>>> For a credible interval, I suppose I could take (say) a 95% credible
>>> interval, then 95 times out of a hundred the model parameters I used
>>> to generate the data set should be in the credible interval. And I
>>> should be able to use the binomial distribution to put limits on how
>>> close to 95% I should get in M trials. This seems to work, but I'm not
>>> sure I understand why. The credible region is obtained from a
>>> probability distribution for the model parameters, but I am turning
>>> things around and testing the distribution of credible regions.
>>
>> If you ignore the Bayesian belief interpretation, then it's just a
>> problem of Probability Theory, and you are just checking the
>> small and large sample behavior of an estimator and a test,
>> whether it has a Bayesian origin or not.
>
> Indeed. But with frequentist tests, I have a clear statement of what
> they're telling me that I can test against: "If you feed this test
> pure noise you'll get a result this high with probability p". I
> haven't figured out how to turn the p-value returned by this test into
> something I can test against.

What exactly are the null and the alternative hypothesis that you
want to test? This is still not clear to me, see also below.


>
>>> In any case, that seems to work, so now I just need to figure out a
>>> similar test for the probability of being pulsed.
>>
>> "probability of being pulsed"
>> I'm not sure what test you have in mind.
>> There are two interpretations:
>> In your current example, fraction is the fraction of observations that
>> are pulsed and fraction=0 is a zero probability event. So you cannot
>> really test fraction==0 versus fraction >0.
>>
>> In the other interpretation you would have a prior probability (mass)
>> that your star is a pulsar with fraction >0 or a non-pulsing unit
>> with fraction=0.
>
> This is what the code currently implements: I begin with a 50% chance
> the signal is unpulsed and a 50% chance the signal is pulsed with some
> fraction >= 0.

I don't see this

in generate you have m = np.random.binomial(n, fraction)
where m is the number of pulsed observations.

the probability of observing no pulsed observations is very
small
>>> stats.binom.pmf(0,100,0.05)
0.0059205292203340009

your likelihood function pdf_data_given_model
also treats each observation with equal fraction to be pulsed or not.

I would have expected something like
case1: pulsed according to binomial fraction, same as now
case2: no observations are pulsed
prior Prob(case1)=0.5, Prob(case2)=0.5

Or am I missing something?

Where is there a test? You have a good posterior distribution
for the fraction, which imply a point estimate and confidence interval,
which look good from the tests.
But I don't see a test hypothesis, (especially a Bayesian statement)

Josef

>
>> The probabilities in both cases would be similar, but the interpretation
>> of the test would be different, and differ between frequentists and
>> Bayesians.
>>
>> Overall your results look almost too "nice", with 8000 observations
>> you get a very narrow posterior in the plot.
>
> If you supply a fairly high pulsed fraction, it's indeed easy to tell
> that it's pulsed with 8000 photons; the difficulty comes when you're
> looking for a 10% pulsed fraction; it's much harder than 800 photons
> with a 100% pulsed fraction. If I were really interested in the
> many-photons case I'd want to think about a prior that made more sense
> for really small fractions. But I'm keeping things simple for now.
>
> Anne
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>


More information about the SciPy-User mailing list