[SciPy-Dev] distributions.py

nicky van foreest vanforeest@gmail....
Sun Sep 16 14:33:29 CDT 2012


On 16 September 2012 20:34, Warren Weckesser
<warren.weckesser@enthought.com> wrote:
>
>
> On Sun, Sep 16, 2012 at 1:17 PM, nicky van foreest <vanforeest@gmail.com>
> wrote:
>>
>> > One comment: I like your idea of splitting the generic code to a
>> > separate file.  But I'd hesitate to create a separate file for each
>> > distribution: that's a lot of files.  In my opinion, a good compromise
>> > would be to create one file for continuous distributions, and one for
>> > discrete.  All of this could be in a new "scipy.stats.distributions"
>> > submodule, for the sake of code organization.
>>
>> I just responded to Josef. This proposal makes the most sense I guess.
>>
>> >
>> > Also, I'd add one more item to your list: make sure all code is PEP8
>> > compliant.  Sometimes the PEP8 guidelines can seem a bit cumbersome, but
>> > they do make browsing and understanding code much easier.
>>
>> I'll check the pep8 documentation.
>>
>> I guess that improving the documentation is most important for the
>> moment. Once this is done, we can go on with splitting
>> distributions.py into two or three smaller files.
>>
>
>
> FWIW, I'm strongly in favor of the following:
> * Split distributions.py into three pieces (generic, discrete, continuous).

I'll put that in my local to do list.

> * Fix the screwy docstrings of the distributions (see the example Jake
> showed in a previous email).

As said in the mail to Jake, I think it is best to update the
docstrings first. That confused me the most when I started using
scipy.stats the most.

> * PEP8.  (Use the pep8 program to check the code.  I just got 1884 "errors"
> when I ran 'pep8 --repeat distributions.py | wc -l'.)

I test the code I contribute with pep8.py.


Nicky
>
> Warren
>
>
>> Nicky
>>
>> >
>> > Thanks again for all your work on this - it's a very valuable
>> > contribution.
>> >     Jake
>> >
>> > On 09/15/2012 02:03 PM, nicky van foreest wrote:
>> >> Hi,
>> >>
>> >> While reading distributions.py I made a kind of private trac list, of
>> >> stuff that might need refactoring, As a matter of fact, all issues
>> >> discussed in the mails above are already on my list. To summarize
>> >> (Please don't take the list below as a complaint, but just factual. I
>> >> am very happy that all this exists.)
>> >>
>> >> 1: the documentation is not clear, too concise, and fragmented;
>> >> actually a bit messy.
>> >>
>> >> 2: there is code overlap in the check work (The lines Ralf mentioned)
>> >> making it hard to find out the differences (but the differences in the
>> >> check work are method dependent so I don't quite know how to tackle
>> >> that in an elegant way),
>> >>
>> >> 3: the docs say that _argscheck need to be rewritten in case users
>> >> build their own distribution. But then the minimal requirement in my
>> >> opinion is that argscheck is simple to understand, and not overly
>> >> generic as it is right now. (I also have examples that its output,
>> >> while in line with its doc string, results in errors.) As far as I can
>> >> see its core can simply be replaced by np.all(cond) (I did not test
>> >> this though).
>> >>
>> >> 4: distributions.py is very big, too big for me actually. I recall
>> >> that my first attempt at finding out how the stats stuff worked was to
>> >> see how expon was implemented. No clue that this resided in
>> >> distributions.py.
>> >>
>> >> What I would like to see, although that would require a considerable
>> >> amount of work, is an architecture like this.
>> >> 1 rv_generic.py containing generic stuff
>> >> 2) rv_continous.py and rv_discrete.py, each imports rv_generic.
>> >> 3) each distribution is covered in a separate file. like expon.py,
>> >> norm, py, etc, and imports rv_continuous.py or rv_discrete.py,
>> >> whatever appropriate. Each docstring can/should contain some generic
>> >> part (like now) and a specific part, with working examples, and clear
>> >> explanations. The most important are normal, expon, binom, geom,
>> >> poisson, and perhaps some others. This would also enable others to
>> >> help extend the documentation, examples....
>> >> 4) I would like to move the math parts in continuous.rst to the doc
>> >> string in the related distribution file.  Since mathjax gives such
>> >> nice results on screen, there is also no reason not to include the
>> >> mathematical facts in the doc string of the distribution itself. In
>> >> fact, most (all?) distributions already have a short math description,
>> >> but this is in overlap with continuous.rst.
>> >>
>> >> I wouldn't mind chopping up distributions.py into the separate
>> >> distributions, and merge it with the maths of continuous.rst. I can
>> >> tackle approx one distribution per day roughly, hence reduce this
>> >> mind-numbing work to roughly 15 minutes a day (correction work on
>> >> exams is much worse :-) ). But I don't know how much this proposal
>> >> will affect the automatic generation of documentation. For the rest I
>> >> don't think this will affect the code a lot.
>> >>
>> >>
>> >>
>> >> NIcky
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On 15 September 2012 11:59, Ralf Gommers <ralf.gommers@gmail.com>
>> >> wrote:
>> >>>
>> >>> On Fri, Sep 14, 2012 at 10:56 PM, Jake Vanderplas
>> >>> <vanderplas@astro.washington.edu> wrote:
>> >>>> On 09/14/2012 01:49 PM, Ralf Gommers wrote:
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Fri, Sep 14, 2012 at 12:48 AM, <josef.pktd@gmail.com> wrote:
>> >>>>> On Thu, Sep 13, 2012 at 5:21 PM, nicky van foreest
>> >>>>> <vanforeest@gmail.com>
>> >>>>> wrote:
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> Now that I understand github (Thanks to Ralf for his explanations
>> >>>>>> in
>> >>>>>> Dutch) and got some simple stuff out of the way in distributions.py
>> >>>>>> I
>> >>>>>> would like to tackle a somewhat harder issue. The function
>> >>>>>> argsreduce
>> >>>>>> is, as far as I can see, too generic. I did some tests to see
>> >>>>>> whether
>> >>>>>> its most generic output, as described by its docstring, is actually
>> >>>>>> swallowed by the callers of argsreduce, but this appears not to be
>> >>>>>> the
>> >>>>>> case.
>> >>>>> being generic is not a disadvantage (per se) if it's fast
>> >>>>>
>> >>>>>
>> >>>>> https://github.com/scipy/scipy/commit/4abdc10487d453b56f761598e8e013816b01a665
>> >>>>> (and a being a one liner is not a disadvantage either)
>> >>>>>
>> >>>>> Josef
>> >>>>>
>> >>>>>> My motivation to simplify the code in distributions.py (and clean
>> >>>>>> it
>> >>>>>> up) is partly based on making it simpler to understand for myself,
>> >>>>>> but
>> >>>>>> also to  others. The fact that github makes code browsing a much
>> >>>>>> nicer
>> >>>>>> experience, perhaps more people will take a look at what's under
>> >>>>>> the
>> >>>>>> hood. But then the code should also be accessible and clean. Are
>> >>>>>> there
>> >>>>>> any reasons not to pursue this path, and focus on more important
>> >>>>>> problems of the stats library?
>> >>>>
>> >>>> Not sure that argsreduce is the best place to start (see Josef's
>> >>>> reply),
>> >>>> but there should be things that can be done to make the code easier
>> >>>> to read.
>> >>>> For example, this code is used in ~10 methods of rv_continuous:
>> >>>>
>> >>>>          loc,scale=map(kwds.get,['loc','scale'])
>> >>>>          args, loc, scale = self._fix_loc_scale(args, loc, scale)
>> >>>>          x,loc,scale = map(asarray,(x,loc,scale))
>> >>>>          args = tuple(map(asarray,args))
>> >>>>
>> >>>> Some refactoring may be in order. The same is true of the rest of the
>> >>>> implementation of many of those methods. Some are exactly the same
>> >>>> except
>> >>>> for calls to the corresponding underscored method (example: logsf()
>> >>>> and
>> >>>> logcdf() are identical except for calls to _logsf() and _logcdf(),
>> >>>> and one
>> >>>> nonsensical multiplication).
>> >>>>
>> >>>> Ralf
>> >>>>
>> >>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> SciPy-Dev mailing list
>> >>>> SciPy-Dev@scipy.org
>> >>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>> >>>>
>> >>>> I would say that the most important improvement needed in
>> >>>> distributions is
>> >>>> in the documentation.
>> >>>>
>> >>>> A new user would look at the doc string of, say, scipy.stats.norm,
>> >>>> and
>> >>>> have no idea how to proceed.  Here's the current example from the
>> >>>> docstring
>> >>>> of scipy.stats.norm:
>> >>>>
>> >>>> Examples
>> >>>> --------
>> >>>>>>> from scipy.stats import norm
>> >>>>>>> numargs = norm.numargs
>> >>>>>>> [  ] = [0.9,] * numargs
>> >>>>>>> rv = norm()
>> >>>>>>> x = np.linspace(0, np.minimum(rv.dist.b, 3))
>> >>>>>>> h = plt.plot(x, rv.pdf(x))
>> >>>> I don't even know what that means... and it doesn't compile.  Also,
>> >>>> what
>> >>>> is b?  how would I enter mu and sigma to make a normal distribution?
>> >>>> It's
>> >>>> all pretty opaque.
>> >>>
>> >>> True, the examples are confusing. The reason is that they're generated
>> >>> from
>> >>> a template, and it's pretty much impossible to get clear and concise
>> >>> examples that way. It would be better to write custom examples for the
>> >>> most-used distributions, and refer to those from the others.
>> >>>
>> >>> Ralf
>> >>>
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> SciPy-Dev mailing list
>> >>> SciPy-Dev@scipy.org
>> >>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>> >>>
>> >> _______________________________________________
>> >> SciPy-Dev mailing list
>> >> SciPy-Dev@scipy.org
>> >> http://mail.scipy.org/mailman/listinfo/scipy-dev
>> >
>> >
>> > _______________________________________________
>> > SciPy-Dev mailing list
>> > SciPy-Dev@scipy.org
>> > http://mail.scipy.org/mailman/listinfo/scipy-dev
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>


More information about the SciPy-Dev mailing list