[SciPy-user] problems with stats
Robert Kern
rkern at ucsd.edu
Tue Apr 27 17:17:54 CDT 2004
danny shevitz wrote:
> The short version of the problem that I was trying to solve is: Given N
> trials and n events, what is the specified confidence interval for the
> binomial probability p.
>
> The hypothetical argument call should therefore take 2 integers and a
> probability, e.g. f(10, 10000000, .95) The third number is a
> confidence, not the probability in the binomial distribution. Of course
> there would have to be a pair of such functions ( one for the lower
> bound, and one for the upper).
What you are looking for is the beta distribution. Here's the Bayesian
analysis:
We have n Bernoulli trials and r successes with an unknown chance
(theta) of success on any given trial. The likelihood is
P(n,r|theta) = theta**r * (1-theta)**(n-r)
The posterior probability is
P(theta|n,r) = A * P(n,r|theta) * P(theta)
A is a normalizing constant and P(theta) is your prior probability for
theta. For now, let's use a minimally informative prior in order to make
the fewest assumptions.
P(theta) o< 1/(theta*(1-theta))
The o< symbol is the proportionality symbol. This prior is improper and
cannot be normalized as it stands. When we put it together with the
likelihood, as long as r does not equal either 0 or n, the posterior
will come out normalized. There are good reasons which I won't go into
now to use this instead of P(theta)=1, but you can use that instead and
redo the math that leads to the following (easy!).
So, P(theta|n,r) = A * theta**(r-1) * (1-theta)**(n-r-1)
We can recognize this as the Beta distribution, Beta(r,n-r) (or
Beta(r+1,n-r+1) if you're queasy about improper priors).
special.bdtri(10, 10000000, 0.95) will give you the one-sided 95%
credible interval (like a confidence interval, only Bayesian). Since
this distribution isn't symmetric, I'm not entirely sure how to get the
highest posterior density (HPD) credible interval around the peak. And
I'm not entirely sure how accurate the function is with those extreme
inputs.
This analysis is lifted straight from Chapter 6 of the excellent book
_Probability Theory: The Logic of Science_ by E. T. Jaynes (Cambridge
Press, 2003).
> Like I said, what I was doing, wasn't particularly well thought out. I
> was just playing around, and using the function wrong anyway...
>
> BTW I guess I have no idea what binom.ppf is supposed to do then. Why
> do you need two probilities? Is one a confidence and the other the
> probability.
The random variable in the binomial distribution is the number of
successes given the probability p of individual success and the number
of trials. The point mass function would invert for n, the number of
successes that, by summing the probability masses for each number of
successes between 0 and n, gives you 0.95 (for example). So yes, one
input is the confidence level, one is the probability of individual
success, and you also need the number of trials.
> Also in case it wasn't obvious, I'm not a statistician by training :-)
>
> D
--
Robert Kern
rkern at ucsd.edu
"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter
More information about the SciPy-user
mailing list