[Numpy-discussion] Statistical distributions on samples

alan@ajackso... alan@ajackso...
Fri Aug 19 15:01:25 CDT 2011


I have applied the update to the documentation (although that function
needs a general rewrite - later...)

>On Mon, Aug 15, 2011 at 8:53 AM, Andrea Gavana <andrea.gavana@gmail.com>wrote:
>
>> Hi Chris and All,
>>
>> On 12 August 2011 16:53, Christopher Jordan-Squire wrote:
>> > Hi Andrea--An easy way to get something like this would be
>> >
>> > import numpy as np
>> > import scipy.stats as stats
>> >
>> > sigma = #some reasonable standard deviation for your application
>> > x = stats.norm.rvs(size=1000, loc=125, scale=sigma)
>> > x = x[x>50]
>> > x = x[x<200]
>> >
>> > That will give a roughly normal distribution to your velocities, as long
>> as,
>> > say, sigma<25. (I'm using the rule of thumb for the normal distribution
>> that
>> > normal random samples lie 3 standard deviations away from the mean about
>> 1
>> > out of 350 times.) Though you won't be able to get exactly normal errors
>> > about your mean since normal random samples can theoretically be of any
>> > size.
>> >
>> > You can use this same process for any other distribution, as long as
>> you've
>> > chosen a scale variable so that the probability of samples being outside
>> > your desired interval is really small. Of course, once again your random
>> > errors won't be exactly from the distribution you get your original
>> samples
>> > from.
>>
>> Thank you for your suggestion. There are a couple of things I am not
>> clear with, however. The first one (the easy one), is: let's suppose I
>> need 200 values, and the accept/discard procedure removes 5 of them
>> from the list. Is there any way to draw these 200 values from a bigger
>> sample so that the accept/reject procedure will not interfere too
>> much? And how do I get 200 values out of the bigger sample so that
>> these values are still representative?
>>
>
>FWIW, I'm not really advocating a truncated normal so much as making the
>standard deviation small enough so that there's no real difference between a
>true normal distribution and a truncated normal.
>
>If you're worried about getting exactly 200 samples, then you could sample N
>with N>200 and such that after throwing out the ones that lie outside your
>desired region you're left with M>200. Then just randomly pick 200 from
>those M. That shouldn't bias anything as long as you randomly pick them. (Or
>just pick the first 200, if you haven't done anything to impose any order on
>the samples, such as sorting them by size.) But I'm not sure why you'd want
>exactly 200 samples instead of some number of samples close to 200.
>
>
>>
>> Another thing, possibly completely unrelated. I am trying to design a
>> toy Latin Hypercube script (just for my own understanding). I found
>> this piece of code on the web (and I modified it slightly):
>>
>> def lhs(dist, size=100):
>>    '''
>>    Latin Hypercube sampling of any distrbution.
>>    dist is is a scipy.stats random number generator
>>    such as stats.norm, stats.beta, etc
>>    parms is a tuple with the parameters needed for
>>    the specified distribution.
>>
>>    :Parameters:
>>        - `dist`: random number generator from scipy.stats module.
>>        - `size` :size for the output sample
>>    '''
>>
>>    n = size
>>
>>    perc = numpy.arange(0.0, 1.0, 1.0/n)
>>    numpy.random.shuffle(perc)
>>
>>    smp = [stats.uniform(i,1.0/n).rvs() for i in perc]
>>
>>    v = dist.ppf(smp)
>>
>>    return v
>>
>>
>> Now, I am not 100% clear of what the percent point function is (I have
>> read around the web, but please keep in mind that my statistical
>> skills are close to minus infinity). From this page:
>>
>> http://www.itl.nist.gov/div898/handbook/eda/section3/eda362.htm
>>
>>
>The ppf is what's called the quantile function elsewhere. I do not know why
>scipy calls it the ppf/percent point function.
>
>The quantile function is the inverse of the cumulative density function
>(cdf). So dist.ppf(z) is the x such that P(dist <= x) = z. Roughly. (Things
>get slightly more finicky if you think about discrete distributions because
>then you have to pick what happens at the jumps in the cdf.) So
>dist.ppf(0.5) gives the median of dist, and dist.ppf(0.25) gives the
>lower/first quartile of dist.
>
>
>> I gather that, if you plot the results of the ppf, with the horizontal
>> axis as probability, the vertical axis goes from the smallest to the
>> largest value of the cumulative distribution function. If i do this:
>>
>> numpy.random.seed(123456)
>>
>> distribution = stats.norm(loc=125, scale=25)
>>
>> my_lhs = lhs(distribution, 50)
>>
>> Will my_lhs always contain valid values (i.e., included between 50 and
>> 200)? I assume the answer is no... but even if this was the case, is
>> this my_lhs array ready to be used to setup a LHS experiment when I
>> have multi-dimensional problems (in which all the variables are
>> completely independent from each other - no correlation)?
>>
>>
>I'm not really sure if the above function is doing the lhs you want.   To
>answer your question, it won't always generate values within [50,200]. If
>size is large enough then you're dividing up the probability space evenly.
>So even with the random perturbations (whose use I don't really understand),
>you'll ensure that some of the samples you get when you apply the ppf will
>correspond to the extremely low probability samples that are <50 or >200.
>
>-Chris JS
>
>My apologies for the idiocy of the questions.
>>
>> Andrea.
>>
>> "Imagination Is The Only Weapon In The War Against Reality."
>> http://xoomer.alice.it/infinity77/
>>
>> >>> import PyQt4.QtGui
>> Traceback (most recent call last):
>>   File "<interactive input>", line 1, in <module>
>> ImportError: No module named PyQt4.QtGui
>> >>>
>> >>> import pygtk
>> Traceback (most recent call last):
>>   File "<interactive input>", line 1, in <module>
>> ImportError: No module named pygtk
>> >>>
>> >>> import wx
>> >>>
>> >>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>


-- 
-----------------------------------------------------------------------
| Alan K. Jackson            | To see a World in a Grain of Sand      |
| alan@ajackson.org          | And a Heaven in a Wild Flower,         |
| www.ajackson.org           | Hold Infinity in the palm of your hand |
| Houston, Texas             | And Eternity in an hour. - Blake       |
-----------------------------------------------------------------------


More information about the NumPy-Discussion mailing list