[SciPy-Dev] distributions.py
nicky van foreest
vanforeest@gmail....
Sun Sep 16 13:17:33 CDT 2012
> One comment: I like your idea of splitting the generic code to a
> separate file. But I'd hesitate to create a separate file for each
> distribution: that's a lot of files. In my opinion, a good compromise
> would be to create one file for continuous distributions, and one for
> discrete. All of this could be in a new "scipy.stats.distributions"
> submodule, for the sake of code organization.
I just responded to Josef. This proposal makes the most sense I guess.
>
> Also, I'd add one more item to your list: make sure all code is PEP8
> compliant. Sometimes the PEP8 guidelines can seem a bit cumbersome, but
> they do make browsing and understanding code much easier.
I'll check the pep8 documentation.
I guess that improving the documentation is most important for the
moment. Once this is done, we can go on with splitting
distributions.py into two or three smaller files.
Nicky
>
> Thanks again for all your work on this - it's a very valuable contribution.
> Jake
>
> On 09/15/2012 02:03 PM, nicky van foreest wrote:
>> Hi,
>>
>> While reading distributions.py I made a kind of private trac list, of
>> stuff that might need refactoring, As a matter of fact, all issues
>> discussed in the mails above are already on my list. To summarize
>> (Please don't take the list below as a complaint, but just factual. I
>> am very happy that all this exists.)
>>
>> 1: the documentation is not clear, too concise, and fragmented;
>> actually a bit messy.
>>
>> 2: there is code overlap in the check work (The lines Ralf mentioned)
>> making it hard to find out the differences (but the differences in the
>> check work are method dependent so I don't quite know how to tackle
>> that in an elegant way),
>>
>> 3: the docs say that _argscheck need to be rewritten in case users
>> build their own distribution. But then the minimal requirement in my
>> opinion is that argscheck is simple to understand, and not overly
>> generic as it is right now. (I also have examples that its output,
>> while in line with its doc string, results in errors.) As far as I can
>> see its core can simply be replaced by np.all(cond) (I did not test
>> this though).
>>
>> 4: distributions.py is very big, too big for me actually. I recall
>> that my first attempt at finding out how the stats stuff worked was to
>> see how expon was implemented. No clue that this resided in
>> distributions.py.
>>
>> What I would like to see, although that would require a considerable
>> amount of work, is an architecture like this.
>> 1 rv_generic.py containing generic stuff
>> 2) rv_continous.py and rv_discrete.py, each imports rv_generic.
>> 3) each distribution is covered in a separate file. like expon.py,
>> norm, py, etc, and imports rv_continuous.py or rv_discrete.py,
>> whatever appropriate. Each docstring can/should contain some generic
>> part (like now) and a specific part, with working examples, and clear
>> explanations. The most important are normal, expon, binom, geom,
>> poisson, and perhaps some others. This would also enable others to
>> help extend the documentation, examples....
>> 4) I would like to move the math parts in continuous.rst to the doc
>> string in the related distribution file. Since mathjax gives such
>> nice results on screen, there is also no reason not to include the
>> mathematical facts in the doc string of the distribution itself. In
>> fact, most (all?) distributions already have a short math description,
>> but this is in overlap with continuous.rst.
>>
>> I wouldn't mind chopping up distributions.py into the separate
>> distributions, and merge it with the maths of continuous.rst. I can
>> tackle approx one distribution per day roughly, hence reduce this
>> mind-numbing work to roughly 15 minutes a day (correction work on
>> exams is much worse :-) ). But I don't know how much this proposal
>> will affect the automatic generation of documentation. For the rest I
>> don't think this will affect the code a lot.
>>
>>
>>
>> NIcky
>>
>>
>>
>>
>>
>> On 15 September 2012 11:59, Ralf Gommers <ralf.gommers@gmail.com> wrote:
>>>
>>> On Fri, Sep 14, 2012 at 10:56 PM, Jake Vanderplas
>>> <vanderplas@astro.washington.edu> wrote:
>>>> On 09/14/2012 01:49 PM, Ralf Gommers wrote:
>>>>
>>>>
>>>>
>>>> On Fri, Sep 14, 2012 at 12:48 AM, <josef.pktd@gmail.com> wrote:
>>>>> On Thu, Sep 13, 2012 at 5:21 PM, nicky van foreest <vanforeest@gmail.com>
>>>>> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Now that I understand github (Thanks to Ralf for his explanations in
>>>>>> Dutch) and got some simple stuff out of the way in distributions.py I
>>>>>> would like to tackle a somewhat harder issue. The function argsreduce
>>>>>> is, as far as I can see, too generic. I did some tests to see whether
>>>>>> its most generic output, as described by its docstring, is actually
>>>>>> swallowed by the callers of argsreduce, but this appears not to be the
>>>>>> case.
>>>>> being generic is not a disadvantage (per se) if it's fast
>>>>>
>>>>> https://github.com/scipy/scipy/commit/4abdc10487d453b56f761598e8e013816b01a665
>>>>> (and a being a one liner is not a disadvantage either)
>>>>>
>>>>> Josef
>>>>>
>>>>>> My motivation to simplify the code in distributions.py (and clean it
>>>>>> up) is partly based on making it simpler to understand for myself, but
>>>>>> also to others. The fact that github makes code browsing a much nicer
>>>>>> experience, perhaps more people will take a look at what's under the
>>>>>> hood. But then the code should also be accessible and clean. Are there
>>>>>> any reasons not to pursue this path, and focus on more important
>>>>>> problems of the stats library?
>>>>
>>>> Not sure that argsreduce is the best place to start (see Josef's reply),
>>>> but there should be things that can be done to make the code easier to read.
>>>> For example, this code is used in ~10 methods of rv_continuous:
>>>>
>>>> loc,scale=map(kwds.get,['loc','scale'])
>>>> args, loc, scale = self._fix_loc_scale(args, loc, scale)
>>>> x,loc,scale = map(asarray,(x,loc,scale))
>>>> args = tuple(map(asarray,args))
>>>>
>>>> Some refactoring may be in order. The same is true of the rest of the
>>>> implementation of many of those methods. Some are exactly the same except
>>>> for calls to the corresponding underscored method (example: logsf() and
>>>> logcdf() are identical except for calls to _logsf() and _logcdf(), and one
>>>> nonsensical multiplication).
>>>>
>>>> Ralf
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> SciPy-Dev mailing list
>>>> SciPy-Dev@scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>>>
>>>> I would say that the most important improvement needed in distributions is
>>>> in the documentation.
>>>>
>>>> A new user would look at the doc string of, say, scipy.stats.norm, and
>>>> have no idea how to proceed. Here's the current example from the docstring
>>>> of scipy.stats.norm:
>>>>
>>>> Examples
>>>> --------
>>>>>>> from scipy.stats import norm
>>>>>>> numargs = norm.numargs
>>>>>>> [ ] = [0.9,] * numargs
>>>>>>> rv = norm()
>>>>>>> x = np.linspace(0, np.minimum(rv.dist.b, 3))
>>>>>>> h = plt.plot(x, rv.pdf(x))
>>>> I don't even know what that means... and it doesn't compile. Also, what
>>>> is b? how would I enter mu and sigma to make a normal distribution? It's
>>>> all pretty opaque.
>>>
>>> True, the examples are confusing. The reason is that they're generated from
>>> a template, and it's pretty much impossible to get clear and concise
>>> examples that way. It would be better to write custom examples for the
>>> most-used distributions, and refer to those from the others.
>>>
>>> Ralf
>>>
>>>
>>>
>>> _______________________________________________
>>> SciPy-Dev mailing list
>>> SciPy-Dev@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
More information about the SciPy-Dev
mailing list