[SciPy-User] Unit testing of Bayesian estimator

josef.pktd@gmai... josef.pktd@gmai...
Sun Nov 8 23:07:53 CST 2009


On Sun, Nov 8, 2009 at 10:22 PM, Anne Archibald
<peridot.faceted@gmail.com> wrote:
> 2009/11/8  <josef.pktd@gmail.com>:
>> On Sun, Nov 8, 2009 at 5:51 PM, Anne Archibald
>> <peridot.faceted@gmail.com> wrote:
>>> 2009/11/8  <josef.pktd@gmail.com>:
>>>
>>>> When I do a Monte Carlo for point estimates, I usually check bias,
>>>> variance, mean squared error,
>>>> and mean absolute and median absolute error (which is a more
>>>> robust to outliers, e.g. because for some cases the estimator produces
>>>> numerical nonsense because of non-convergence or other numerical
>>>> problems). MSE captures better cases of biased estimators that are
>>>> better in MSE sense.
>>>
>>> I can certainly compute all these quantities from a collection of
>>> Monte Carlo runs, but I don't have any idea what values would indicate
>>> correctness, apart from "not too big".
>>
>> I consider them mainly as an absolute standard to see how well
>> an estimator works (or what the size and power of a test is) or
>> to compare them to other estimators, which is a common case for
>> publishing Monte Carlo studies for new estimators.
>
> Ah. Yes, I will definitely be wanting to compute these at some point.
> But first I'd like to make sure this estimator is doing what I want it
> to.
>
>>> I could test more quantiles, but I'm very distrustful of testing more
>>> than one quantile per randomly-generated sample: they should be
>>> covariant (if the 90% mark is too high, the 95% mark will almost
>>> certainly be too high as well) and I don't know how to take that into
>>> account. And running the test is currently so slow I'm inclined to
>>> spend my CPU time on a stricter test of a single quantile. Though
>>> unfortunately to increase the strictness I also need to improve the
>>> sampling in phase and fraction.
>>
>> Adding additional quantiles might be relatively cheap, mainly the
>> call to searchsorted. One or two quantiles could be consistent
>> with many different distributions or e.g with fatter tails, so I usually
>> check more points.
>
> As I said, I'm concerned about using more than one credible interval
> per simulation run, since the credible intervals for different
> quantiles will be different sizes.
>
>>>> In each iteration of the Monte Carlo you get a full posterior distribution,
>>>> after a large number of iterations you have a sampling distribution,
>>>> and it should be possible to compare this distribution with the
>>>> posterior distributions. I'm still not sure how.
>>>
>>> I don't understand what you mean here. I do get a full posterior
>>> distribution out of every simulation. But how would I combine these
>>> different distributions, and what would the combined distribution
>>> mean?
>>
>> I'm still trying to think how this can be done. Checking more quantiles
>> as discussed above might be doing it to some extend.
>> (I also wonder whether it might be useful to fix the observations
>> during the monte carlo and only vary the sampling of the parameters ?)
>
> I can see that fixing the data would be sort of nice, but it's not at
> all clear to me what it would even mean to vary the model while
> keeping the data constant - after all, the estimator has no access to
> the model, only the data, so varying the model would have no effect on
> the result returned.
>
>>> Indeed. But with frequentist tests, I have a clear statement of what
>>> they're telling me that I can test against: "If you feed this test
>>> pure noise you'll get a result this high with probability p". I
>>> haven't figured out how to turn the p-value returned by this test into
>>> something I can test against.
>>
>> What exactly are the null and the alternative hypothesis that you
>> want to test? This is still not clear to me, see also below.
>
> Null hypothesis: no pulsations, all photons are drawn from a uniform
> distributions.
>
> Alternative: photons are drawn from a distribution with pulsed
> fraction f and phase p.
>
>>> This is what the code currently implements: I begin with a 50% chance
>>> the signal is unpulsed and a 50% chance the signal is pulsed with some
>>> fraction >= 0.
>>
>> I don't see this
>>
>> in generate you have m = np.random.binomial(n, fraction)
>> where m is the number of pulsed observations.
>
> Generate can be used to generate photons from a uniform distribution
> by calling it with fraction set to zero. I don't actually do this
> while testing the credible intervals, because  (as I understand it)
> the presence of this hypothesis does not affect the credible
> intervals. That is, the credible intervals I'm testing are the
> credible intervals assuming that the pulsations are real. I'm not at
> all sure how to incorporate the alternative hypothesis into my
> testing.
>
>> the probability of observing no pulsed observations is very
>> small
>>>>> stats.binom.pmf(0,100,0.05)
>> 0.0059205292203340009
>>
>> your likelihood function pdf_data_given_model
>> also treats each observation with equal fraction to be pulsed or not.
>
> pdf_data_given_model computes the PDF given a set of model parameters.
> If you want to use it to get a likelihood with fraction=0, you can
> call it with fraction=0. But this likelihood is always zero.
>
>> I would have expected something like
>> case1: pulsed according to binomial fraction, same as now
>> case2: no observations are pulsed
>> prior Prob(case1)=0.5, Prob(case2)=0.5
>>
>> Or am I missing something?
>>
>> Where is there a test? You have a good posterior distribution
>> for the fraction, which imply a point estimate and confidence interval,
>> which look good from the tests.
>> But I don't see a test hypothesis, (especially a Bayesian statement)
>
> When I came to implement it, the only place I actually needed to
> mention the null hypothesis was in the calculation of the pulsed
> probability, which is the last value returned by the inference
> routine. I did make the somewhat peculiar choice to have the model PDF
> returned normalized so that its total probability was one, rather than
> scaling all points by the probability that the system is pulsed at
> all.
>
> It turns out that the inference code just computes the total
> normalization S over all parameters; then the probability that the
> signal is pulsed is S/(S+1).

Ok, I think I'm starting to see how this works, since you drop the
prior probabilities (0.5, 0.5) and the likelihood under the uniform
distribution is just 1, everything is pretty reduced form.

>From the posterior probability S/(S+1), you could construct
a decision rule similar to a classical test, e.g. accept null
if S/(S+1) < 0.95, and then construct a MonteCarlo
with samples drawn form either the uniform or the pulsed
distribution in the same way as for a classical test, and
verify that the decision mistakes, alpha and beta errors, in the
sample are close to the posterior probabilities.
The posterior probability would be similar to the p-value
in a classical test. If you want to balance alpha and
beta errors, a threshold S/(S+1)<0.5 would be more
appropriate, but for the unit tests it wouldn't matter.

Running the example a few times, it looks like that the power
is relatively low for distinguishing uniform distribution from
a pulsed distribution with fraction/binomial parameter 0.05
and sample size <1000.
If you have strong beliefs that the fraction is really this low
than an informative prior for the fraction, might improve the
results.

Josef


>
> Anne
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>


More information about the SciPy-User mailing list