[SciPy-User] Scipy and statistics: probability density function
Robert Kern
robert.kern@gmail....
Mon Jul 27 10:59:36 CDT 2009
On Mon, Jul 27, 2009 at 02:33, Daniel J
Farrell<daniel.farrell@imperial.ac.uk> wrote:
> Dear list,
>
> I am looking for some of the functionality provided by the GNU
> Scientific Library histogram module (http://www.gnu.org/software/gsl/manual/html_node/Histograms.html
> ).
>
> In particular, a need to be able to create a probability density
> function from my histogram of data. This will allow inverse look-ups
> to be performed, i.e. for a random number (0-->1) find the associated
> probability (http://www.gnu.org/software/gsl/manual/html_node/The-histogram-probability-distribution-struct.html
> ). This allows the distribution and for samples to be returned
> weighted by the probability of the distribution -- which is a common
> task!
It looks like you want a CDF (or rather it's inverse, the PPF) rather
than a PDF. Anyways, this is straightforward. Compute the histogram
using normed=False. Find the cumulative sum and divide by the sum to
get the (smoothed) empirical CDF. Prepend a 0.0 to this, and then this
will align with the edges array that is also returned. Then you can
use linear interpolation to do lookups. If you use the edges array as
"X" and the empirical CDF as "Y", then this is a CDF. If you use the
empircal CDF array as "X" and the edges as "Y", then this is a PPF.
In [12]: x = np.random.uniform(0, 10, size=1000)
In [13]: hist, edges = np.histogram(x)
In [15]: hist
Out[15]: array([100, 101, 104, 108, 80, 111, 96, 88, 108, 104])
In [16]: edges
Out[16]:
array([ 3.53879571e-04, 1.00026423e+00, 2.00017458e+00,
3.00008493e+00, 3.99999528e+00, 4.99990563e+00,
5.99981598e+00, 6.99972633e+00, 7.99963668e+00,
8.99954704e+00, 9.99945739e+00])
In [18]: ecdf = np.hstack([0.0, hist.cumsum() / float(hist.sum())])
In [19]: ecdf
Out[19]:
array([ 0. , 0.1 , 0.201, 0.305, 0.413, 0.493, 0.604, 0.7 ,
0.788, 0.896, 1. ])
In [20]: np.interp(np.linspace(0, 10), edges, ecdf)
Out[20]:
array([ 0. , 0.0203746 , 0.04078459, 0.06119459, 0.08160458,
0.10203472, 0.12264881, 0.14326291, 0.163877 , 0.18449109,
0.20522712, 0.22645351, 0.24767991, 0.2689063 , 0.29013269,
0.31160366, 0.33364646, 0.35568925, 0.37773204, 0.39977483,
0.41953158, 0.43585957, 0.45218756, 0.46851556, 0.48484355,
0.50433802, 0.52699311, 0.54964821, 0.5723033 , 0.59495839,
0.61577382, 0.63536742, 0.65496101, 0.6745546 , 0.6941482 ,
0.71259664, 0.73055743, 0.74851823, 0.76647902, 0.78443982,
0.80567348, 0.82771627, 0.84975906, 0.87180185, 0.89384465,
0.91515087, 0.93637726, 0.95760365, 0.97883004, 1. ])
In [21]: np.interp(np.linspace(0, 1), ecdf, edges)
Out[21]:
array([ 3.53879571e-04, 2.04417216e-01, 4.08480553e-01,
6.12543890e-01, 8.16607227e-01, 1.02046852e+00,
1.22251143e+00, 1.42455434e+00, 1.62659724e+00,
1.82864015e+00, 2.02980301e+00, 2.22601775e+00,
2.42223250e+00, 2.61844725e+00, 2.81466200e+00,
3.01047705e+00, 3.19942458e+00, 3.38837211e+00,
3.57731965e+00, 3.76626718e+00, 3.95521472e+00,
4.19462069e+00, 4.44969986e+00, 4.70477903e+00,
4.95985820e+00, 5.15488346e+00, 5.33872431e+00,
5.52256515e+00, 5.70640600e+00, 5.89024684e+00,
6.08569264e+00, 6.29825861e+00, 6.51082459e+00,
6.72339057e+00, 6.93595654e+00, 7.16204944e+00,
7.39393960e+00, 7.62582975e+00, 7.85771991e+00,
8.07294833e+00, 8.26189586e+00, 8.45084340e+00,
8.63979093e+00, 8.82873846e+00, 9.01838365e+00,
9.21459840e+00, 9.41081315e+00, 9.60702789e+00,
9.80324264e+00, 9.99945739e+00])
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
More information about the SciPy-User
mailing list