[SciPy-Dev] distributions.py

nicky van foreest vanforeest@gmail....
Sat Sep 15 16:03:46 CDT 2012


While reading distributions.py I made a kind of private trac list, of
stuff that might need refactoring, As a matter of fact, all issues
discussed in the mails above are already on my list. To summarize
(Please don't take the list below as a complaint, but just factual. I
am very happy that all this exists.)

1: the documentation is not clear, too concise, and fragmented;
actually a bit messy.

2: there is code overlap in the check work (The lines Ralf mentioned)
making it hard to find out the differences (but the differences in the
check work are method dependent so I don't quite know how to tackle
that in an elegant way),

3: the docs say that _argscheck need to be rewritten in case users
build their own distribution. But then the minimal requirement in my
opinion is that argscheck is simple to understand, and not overly
generic as it is right now. (I also have examples that its output,
while in line with its doc string, results in errors.) As far as I can
see its core can simply be replaced by np.all(cond) (I did not test
this though).

4: distributions.py is very big, too big for me actually. I recall
that my first attempt at finding out how the stats stuff worked was to
see how expon was implemented. No clue that this resided in

What I would like to see, although that would require a considerable
amount of work, is an architecture like this.
1 rv_generic.py containing generic stuff
2) rv_continous.py and rv_discrete.py, each imports rv_generic.
3) each distribution is covered in a separate file. like expon.py,
norm, py, etc, and imports rv_continuous.py or rv_discrete.py,
whatever appropriate. Each docstring can/should contain some generic
part (like now) and a specific part, with working examples, and clear
explanations. The most important are normal, expon, binom, geom,
poisson, and perhaps some others. This would also enable others to
help extend the documentation, examples....
4) I would like to move the math parts in continuous.rst to the doc
string in the related distribution file.  Since mathjax gives such
nice results on screen, there is also no reason not to include the
mathematical facts in the doc string of the distribution itself. In
fact, most (all?) distributions already have a short math description,
but this is in overlap with continuous.rst.

I wouldn't mind chopping up distributions.py into the separate
distributions, and merge it with the maths of continuous.rst. I can
tackle approx one distribution per day roughly, hence reduce this
mind-numbing work to roughly 15 minutes a day (correction work on
exams is much worse :-) ). But I don't know how much this proposal
will affect the automatic generation of documentation. For the rest I
don't think this will affect the code a lot.


On 15 September 2012 11:59, Ralf Gommers <ralf.gommers@gmail.com> wrote:
> On Fri, Sep 14, 2012 at 10:56 PM, Jake Vanderplas
> <vanderplas@astro.washington.edu> wrote:
>> On 09/14/2012 01:49 PM, Ralf Gommers wrote:
>> On Fri, Sep 14, 2012 at 12:48 AM, <josef.pktd@gmail.com> wrote:
>>> On Thu, Sep 13, 2012 at 5:21 PM, nicky van foreest <vanforeest@gmail.com>
>>> wrote:
>>> > Hi,
>>> >
>>> > Now that I understand github (Thanks to Ralf for his explanations in
>>> > Dutch) and got some simple stuff out of the way in distributions.py I
>>> > would like to tackle a somewhat harder issue. The function argsreduce
>>> > is, as far as I can see, too generic. I did some tests to see whether
>>> > its most generic output, as described by its docstring, is actually
>>> > swallowed by the callers of argsreduce, but this appears not to be the
>>> > case.
>>> being generic is not a disadvantage (per se) if it's fast
>>> https://github.com/scipy/scipy/commit/4abdc10487d453b56f761598e8e013816b01a665
>>> (and a being a one liner is not a disadvantage either)
>>> Josef
>>> >
>>> > My motivation to simplify the code in distributions.py (and clean it
>>> > up) is partly based on making it simpler to understand for myself, but
>>> > also to  others. The fact that github makes code browsing a much nicer
>>> > experience, perhaps more people will take a look at what's under the
>>> > hood. But then the code should also be accessible and clean. Are there
>>> > any reasons not to pursue this path, and focus on more important
>>> > problems of the stats library?
>> Not sure that argsreduce is the best place to start (see Josef's reply),
>> but there should be things that can be done to make the code easier to read.
>> For example, this code is used in ~10 methods of rv_continuous:
>>         loc,scale=map(kwds.get,['loc','scale'])
>>         args, loc, scale = self._fix_loc_scale(args, loc, scale)
>>         x,loc,scale = map(asarray,(x,loc,scale))
>>         args = tuple(map(asarray,args))
>> Some refactoring may be in order. The same is true of the rest of the
>> implementation of many of those methods. Some are exactly the same except
>> for calls to the corresponding underscored method (example: logsf() and
>> logcdf() are identical except for calls to _logsf() and _logcdf(), and one
>> nonsensical multiplication).
>> Ralf
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>> I would say that the most important improvement needed in distributions is
>> in the documentation.
>> A new user would look at the doc string of, say, scipy.stats.norm, and
>> have no idea how to proceed.  Here's the current example from the docstring
>> of scipy.stats.norm:
>> Examples
>> --------
>> >>> from scipy.stats import norm
>> >>> numargs = norm.numargs
>> >>> [  ] = [0.9,] * numargs
>> >>> rv = norm()
>> >>> x = np.linspace(0, np.minimum(rv.dist.b, 3))
>> >>> h = plt.plot(x, rv.pdf(x))
>> I don't even know what that means... and it doesn't compile.  Also, what
>> is b?  how would I enter mu and sigma to make a normal distribution?  It's
>> all pretty opaque.
> True, the examples are confusing. The reason is that they're generated from
> a template, and it's pretty much impossible to get clear and concise
> examples that way. It would be better to write custom examples for the
> most-used distributions, and refer to those from the others.
> Ralf
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev

More information about the SciPy-Dev mailing list