# [SciPy-Dev] distributions.py

nicky van foreest vanforeest@gmail....
Sun Sep 16 14:36:33 CDT 2012

> I looked in the past how the conditions are build, and I gave up
> trying to unify them  after a short time.
> pdf is zero outside of support
> cdf, sf is zero or one outside of support
> ppf, isf produces nan if not in [0,1]
>
> boundary points are either included or treated explicitly
> all produce nan if shape parameter is invalid.
>
> reading the conditions for all corner cases might cause headaches :)

The more I think about it, the more I tend to agree. There are many
distributions, with lots of properties. However, there may be some
simple lines that the conditions have in common. I'll check this after
I completed work on the documentation, and the split of
distributions.py.

>> 3: the docs say that _argscheck need to be rewritten in case users
>> build their own distribution. But then the minimal requirement in my
>> opinion is that argscheck is simple to understand, and not overly
>> generic as it is right now. (I also have examples that its output,
>> while in line with its doc string, results in errors.) As far as I can
>> see its core can simply be replaced by np.all(cond) (I did not test
>> this though).
>
> np.all(cond)  will not work
>
> from code comment:
> "Returns condition array of 1's where arguments are correct and 0's
> where they are not."
>
> _argcheck is *elementwise* check for valid parameters
> furthermore, in some cases _argcheck needs to set a, b if those depend
> on shape parameters.
>
>
> no def __init__()
>

I'll save this comment in my todo list, and turn to it later.

Nicky

> Josef
>
>>
>> 4: distributions.py is very big, too big for me actually. I recall
>> that my first attempt at finding out how the stats stuff worked was to
>> see how expon was implemented. No clue that this resided in
>> distributions.py.
>>
>> What I would like to see, although that would require a considerable
>> amount of work, is an architecture like this.
>> 1 rv_generic.py containing generic stuff
>> 2) rv_continous.py and rv_discrete.py, each imports rv_generic.
>> 3) each distribution is covered in a separate file. like expon.py,
>> norm, py, etc, and imports rv_continuous.py or rv_discrete.py,
>> whatever appropriate. Each docstring can/should contain some generic
>> part (like now) and a specific part, with working examples, and clear
>> explanations. The most important are normal, expon, binom, geom,
>> poisson, and perhaps some others. This would also enable others to
>> help extend the documentation, examples....
>> 4) I would like to move the math parts in continuous.rst to the doc
>> string in the related distribution file.  Since mathjax gives such
>> nice results on screen, there is also no reason not to include the
>> mathematical facts in the doc string of the distribution itself. In
>> fact, most (all?) distributions already have a short math description,
>> but this is in overlap with continuous.rst.
>>
>> I wouldn't mind chopping up distributions.py into the separate
>> distributions, and merge it with the maths of continuous.rst. I can
>> tackle approx one distribution per day roughly, hence reduce this
>> mind-numbing work to roughly 15 minutes a day (correction work on
>> exams is much worse :-) ). But I don't know how much this proposal
>> will affect the automatic generation of documentation. For the rest I
>> don't think this will affect the code a lot.
>>
>>
>>
>> NIcky
>>
>>
>>
>>
>>
>> On 15 September 2012 11:59, Ralf Gommers <ralf.gommers@gmail.com> wrote:
>>>
>>>
>>> On Fri, Sep 14, 2012 at 10:56 PM, Jake Vanderplas
>>> <vanderplas@astro.washington.edu> wrote:
>>>>
>>>> On 09/14/2012 01:49 PM, Ralf Gommers wrote:
>>>>
>>>>
>>>>
>>>> On Fri, Sep 14, 2012 at 12:48 AM, <josef.pktd@gmail.com> wrote:
>>>>>
>>>>> On Thu, Sep 13, 2012 at 5:21 PM, nicky van foreest <vanforeest@gmail.com>
>>>>> wrote:
>>>>> > Hi,
>>>>> >
>>>>> > Now that I understand github (Thanks to Ralf for his explanations in
>>>>> > Dutch) and got some simple stuff out of the way in distributions.py I
>>>>> > would like to tackle a somewhat harder issue. The function argsreduce
>>>>> > is, as far as I can see, too generic. I did some tests to see whether
>>>>> > its most generic output, as described by its docstring, is actually
>>>>> > swallowed by the callers of argsreduce, but this appears not to be the
>>>>> > case.
>>>>>
>>>>> being generic is not a disadvantage (per se) if it's fast
>>>>>
>>>>> https://github.com/scipy/scipy/commit/4abdc10487d453b56f761598e8e013816b01a665
>>>>> (and a being a one liner is not a disadvantage either)
>>>>>
>>>>> Josef
>>>>>
>>>>> >
>>>>> > My motivation to simplify the code in distributions.py (and clean it
>>>>> > up) is partly based on making it simpler to understand for myself, but
>>>>> > also to  others. The fact that github makes code browsing a much nicer
>>>>> > experience, perhaps more people will take a look at what's under the
>>>>> > hood. But then the code should also be accessible and clean. Are there
>>>>> > any reasons not to pursue this path, and focus on more important
>>>>> > problems of the stats library?
>>>>
>>>>
>>>> Not sure that argsreduce is the best place to start (see Josef's reply),
>>>> but there should be things that can be done to make the code easier to read.
>>>> For example, this code is used in ~10 methods of rv_continuous:
>>>>
>>>>         loc,scale=map(kwds.get,['loc','scale'])
>>>>         args, loc, scale = self._fix_loc_scale(args, loc, scale)
>>>>         x,loc,scale = map(asarray,(x,loc,scale))
>>>>         args = tuple(map(asarray,args))
>>>>
>>>> Some refactoring may be in order. The same is true of the rest of the
>>>> implementation of many of those methods. Some are exactly the same except
>>>> for calls to the corresponding underscored method (example: logsf() and
>>>> logcdf() are identical except for calls to _logsf() and _logcdf(), and one
>>>> nonsensical multiplication).
>>>>
>>>> Ralf
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> SciPy-Dev mailing list
>>>> SciPy-Dev@scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>>>
>>>> I would say that the most important improvement needed in distributions is
>>>> in the documentation.
>>>>
>>>> A new user would look at the doc string of, say, scipy.stats.norm, and
>>>> have no idea how to proceed.  Here's the current example from the docstring
>>>> of scipy.stats.norm:
>>>>
>>>> Examples
>>>> --------
>>>> >>> from scipy.stats import norm
>>>> >>> numargs = norm.numargs
>>>> >>> [  ] = [0.9,] * numargs
>>>> >>> rv = norm()
>>>>
>>>> >>> x = np.linspace(0, np.minimum(rv.dist.b, 3))
>>>> >>> h = plt.plot(x, rv.pdf(x))
>>>>
>>>> I don't even know what that means... and it doesn't compile.  Also, what
>>>> is b?  how would I enter mu and sigma to make a normal distribution?  It's
>>>> all pretty opaque.
>>>
>>>
>>> True, the examples are confusing. The reason is that they're generated from
>>> a template, and it's pretty much impossible to get clear and concise
>>> examples that way. It would be better to write custom examples for the
>>> most-used distributions, and refer to those from the others.
>>>
>>> Ralf
>>>
>>>
>>>
>>> _______________________________________________
>>> SciPy-Dev mailing list
>>> SciPy-Dev@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev