[SciPy-user] problems with stats

Robert Kern rkern at ucsd.edu
Tue Apr 27 17:17:54 CDT 2004


danny shevitz wrote:

> The short version of the problem that I was trying to solve is: Given N
> trials and n events, what is the specified confidence interval for the
> binomial probability p.
> 
> The hypothetical argument call should therefore take 2 integers and a
> probability, e.g. f(10, 10000000, .95) The third number is a
> confidence, not the probability in the binomial distribution. Of course
> there would have to be a pair of such functions ( one for the lower
> bound, and one for the upper). 

What you are looking for is the beta distribution. Here's the Bayesian 
analysis:

We have n Bernoulli trials and r successes with an unknown chance 
(theta) of success on any given trial. The likelihood is

P(n,r|theta) = theta**r * (1-theta)**(n-r)

The posterior probability is

P(theta|n,r) = A * P(n,r|theta) * P(theta)

A is a normalizing constant and P(theta) is your prior probability for 
theta. For now, let's use a minimally informative prior in order to make 
the fewest assumptions.

P(theta) o< 1/(theta*(1-theta))

The o< symbol is the proportionality symbol. This prior is improper and 
cannot be normalized as it stands. When we put it together with the 
likelihood, as long as r does not equal either 0 or n, the posterior 
will come out normalized. There are good reasons which I won't go into 
now to use this instead of P(theta)=1, but you can use that instead and 
redo the math that leads to the following (easy!).

So, P(theta|n,r) = A * theta**(r-1) * (1-theta)**(n-r-1)

We can recognize this as the Beta distribution, Beta(r,n-r) (or 
Beta(r+1,n-r+1) if you're queasy about improper priors).

special.bdtri(10, 10000000, 0.95) will give you the one-sided 95% 
credible interval (like a confidence interval, only Bayesian). Since 
this distribution isn't symmetric, I'm not entirely sure how to get the 
highest posterior density (HPD) credible interval around the peak. And 
I'm not entirely sure how accurate the function is with those extreme 
inputs.

This analysis is lifted straight from Chapter 6 of the excellent book 
_Probability Theory: The Logic of Science_ by E. T. Jaynes (Cambridge 
Press, 2003).

> Like I said, what I was doing, wasn't particularly well thought out. I
> was just playing around, and using the function wrong anyway...
> 
> BTW I guess I have no idea what binom.ppf is supposed to do then. Why
> do you need two probilities? Is one a confidence and the other the
> probability.

The random variable in the binomial distribution is the number of 
successes given the probability p of individual success and the number 
of trials. The point mass function would invert for n, the number of 
successes that, by summing the probability masses for each number of 
successes between 0 and n, gives you 0.95 (for example). So yes, one 
input is the confidence level, one is the probability of individual 
success, and you also need the number of trials.

> Also in case it wasn't obvious, I'm not a statistician by training :-)
> 
> D  

-- 
Robert Kern
rkern at ucsd.edu

"In the fields of hell where the grass grows high
  Are the graves of dreams allowed to die."
   -- Richard Harter



More information about the SciPy-user mailing list