[SciPy-Dev] distributions.py

Jake Vanderplas vanderplas@astro.washington....
Sun Sep 16 10:58:59 CDT 2012

This is great - thanks so much for being willing to take it on!  I will 
plan to help where I can.

One comment: I like your idea of splitting the generic code to a 
separate file.  But I'd hesitate to create a separate file for each 
distribution: that's a lot of files.  In my opinion, a good compromise 
would be to create one file for continuous distributions, and one for 
discrete.  All of this could be in a new "scipy.stats.distributions" 
submodule, for the sake of code organization.

Also, I'd add one more item to your list: make sure all code is PEP8 
compliant.  Sometimes the PEP8 guidelines can seem a bit cumbersome, but 
they do make browsing and understanding code much easier.

Thanks again for all your work on this - it's a very valuable contribution.

On 09/15/2012 02:03 PM, nicky van foreest wrote:
> Hi,
> While reading distributions.py I made a kind of private trac list, of
> stuff that might need refactoring, As a matter of fact, all issues
> discussed in the mails above are already on my list. To summarize
> (Please don't take the list below as a complaint, but just factual. I
> am very happy that all this exists.)
> 1: the documentation is not clear, too concise, and fragmented;
> actually a bit messy.
> 2: there is code overlap in the check work (The lines Ralf mentioned)
> making it hard to find out the differences (but the differences in the
> check work are method dependent so I don't quite know how to tackle
> that in an elegant way),
> 3: the docs say that _argscheck need to be rewritten in case users
> build their own distribution. But then the minimal requirement in my
> opinion is that argscheck is simple to understand, and not overly
> generic as it is right now. (I also have examples that its output,
> while in line with its doc string, results in errors.) As far as I can
> see its core can simply be replaced by np.all(cond) (I did not test
> this though).
> 4: distributions.py is very big, too big for me actually. I recall
> that my first attempt at finding out how the stats stuff worked was to
> see how expon was implemented. No clue that this resided in
> distributions.py.
> What I would like to see, although that would require a considerable
> amount of work, is an architecture like this.
> 1 rv_generic.py containing generic stuff
> 2) rv_continous.py and rv_discrete.py, each imports rv_generic.
> 3) each distribution is covered in a separate file. like expon.py,
> norm, py, etc, and imports rv_continuous.py or rv_discrete.py,
> whatever appropriate. Each docstring can/should contain some generic
> part (like now) and a specific part, with working examples, and clear
> explanations. The most important are normal, expon, binom, geom,
> poisson, and perhaps some others. This would also enable others to
> help extend the documentation, examples....
> 4) I would like to move the math parts in continuous.rst to the doc
> string in the related distribution file.  Since mathjax gives such
> nice results on screen, there is also no reason not to include the
> mathematical facts in the doc string of the distribution itself. In
> fact, most (all?) distributions already have a short math description,
> but this is in overlap with continuous.rst.
> I wouldn't mind chopping up distributions.py into the separate
> distributions, and merge it with the maths of continuous.rst. I can
> tackle approx one distribution per day roughly, hence reduce this
> mind-numbing work to roughly 15 minutes a day (correction work on
> exams is much worse :-) ). But I don't know how much this proposal
> will affect the automatic generation of documentation. For the rest I
> don't think this will affect the code a lot.
> NIcky
> On 15 September 2012 11:59, Ralf Gommers <ralf.gommers@gmail.com> wrote:
>> On Fri, Sep 14, 2012 at 10:56 PM, Jake Vanderplas
>> <vanderplas@astro.washington.edu> wrote:
>>> On 09/14/2012 01:49 PM, Ralf Gommers wrote:
>>> On Fri, Sep 14, 2012 at 12:48 AM, <josef.pktd@gmail.com> wrote:
>>>> On Thu, Sep 13, 2012 at 5:21 PM, nicky van foreest <vanforeest@gmail.com>
>>>> wrote:
>>>>> Hi,
>>>>> Now that I understand github (Thanks to Ralf for his explanations in
>>>>> Dutch) and got some simple stuff out of the way in distributions.py I
>>>>> would like to tackle a somewhat harder issue. The function argsreduce
>>>>> is, as far as I can see, too generic. I did some tests to see whether
>>>>> its most generic output, as described by its docstring, is actually
>>>>> swallowed by the callers of argsreduce, but this appears not to be the
>>>>> case.
>>>> being generic is not a disadvantage (per se) if it's fast
>>>> https://github.com/scipy/scipy/commit/4abdc10487d453b56f761598e8e013816b01a665
>>>> (and a being a one liner is not a disadvantage either)
>>>> Josef
>>>>> My motivation to simplify the code in distributions.py (and clean it
>>>>> up) is partly based on making it simpler to understand for myself, but
>>>>> also to  others. The fact that github makes code browsing a much nicer
>>>>> experience, perhaps more people will take a look at what's under the
>>>>> hood. But then the code should also be accessible and clean. Are there
>>>>> any reasons not to pursue this path, and focus on more important
>>>>> problems of the stats library?
>>> Not sure that argsreduce is the best place to start (see Josef's reply),
>>> but there should be things that can be done to make the code easier to read.
>>> For example, this code is used in ~10 methods of rv_continuous:
>>>          loc,scale=map(kwds.get,['loc','scale'])
>>>          args, loc, scale = self._fix_loc_scale(args, loc, scale)
>>>          x,loc,scale = map(asarray,(x,loc,scale))
>>>          args = tuple(map(asarray,args))
>>> Some refactoring may be in order. The same is true of the rest of the
>>> implementation of many of those methods. Some are exactly the same except
>>> for calls to the corresponding underscored method (example: logsf() and
>>> logcdf() are identical except for calls to _logsf() and _logcdf(), and one
>>> nonsensical multiplication).
>>> Ralf
>>> _______________________________________________
>>> SciPy-Dev mailing list
>>> SciPy-Dev@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>> I would say that the most important improvement needed in distributions is
>>> in the documentation.
>>> A new user would look at the doc string of, say, scipy.stats.norm, and
>>> have no idea how to proceed.  Here's the current example from the docstring
>>> of scipy.stats.norm:
>>> Examples
>>> --------
>>>>>> from scipy.stats import norm
>>>>>> numargs = norm.numargs
>>>>>> [  ] = [0.9,] * numargs
>>>>>> rv = norm()
>>>>>> x = np.linspace(0, np.minimum(rv.dist.b, 3))
>>>>>> h = plt.plot(x, rv.pdf(x))
>>> I don't even know what that means... and it doesn't compile.  Also, what
>>> is b?  how would I enter mu and sigma to make a normal distribution?  It's
>>> all pretty opaque.
>> True, the examples are confusing. The reason is that they're generated from
>> a template, and it's pretty much impossible to get clear and concise
>> examples that way. It would be better to write custom examples for the
>> most-used distributions, and refer to those from the others.
>> Ralf
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev

More information about the SciPy-Dev mailing list