[SciPy-User] R vs Python for simple interactive data analysis

josef.pktd@gmai... josef.pktd@gmai...
Mon Aug 29 16:51:02 CDT 2011

On Mon, Aug 29, 2011 at 5:03 PM, Christopher Jordan-Squire
<cjordan1@uw.edu> wrote:
> On Mon, Aug 29, 2011 at 12:13 PM,  <josef.pktd@gmail.com> wrote:
>> On Mon, Aug 29, 2011 at 12:59 PM,  <josef.pktd@gmail.com> wrote:
>>> On Mon, Aug 29, 2011 at 11:42 AM,  <josef.pktd@gmail.com> wrote:
>>>> On Mon, Aug 29, 2011 at 11:34 AM, Christopher Jordan-Squire
>>>> <cjordan1@uw.edu> wrote:
>>>>> On Mon, Aug 29, 2011 at 10:27 AM,  <josef.pktd@gmail.com> wrote:
>>>>>> On Mon, Aug 29, 2011 at 11:10 AM, Skipper Seabold <jsseabold@gmail.com> wrote:
>>>>>>> On Mon, Aug 29, 2011 at 10:57 AM, Christopher Jordan-Squire
>>>>>>> <cjordan1@uw.edu> wrote:
>>>>>>>> On Sun, Aug 28, 2011 at 2:54 PM, Skipper Seabold <jsseabold@gmail.com> wrote:
>>>>>>>>> On Sat, Aug 27, 2011 at 10:15 PM, Bruce Southey <bsouthey@gmail.com> wrote:
>>>>>>>>>> On Sat, Aug 27, 2011 at 5:06 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
>>>>>>>>>>> On Sat, Aug 27, 2011 at 5:03 PM, Jason Grout
>>>>>>>>>>> <jason-sage@creativetrax.com> wrote:
>>>>>>>>>>>> On 8/27/11 1:19 PM, Christopher Jordan-Squire wrote:
>>>>>>>>>>>>> This comparison might be useful to some people, so I stuck it up on a
>>>>>>>>>>>>> github repo. My overall impression is that R is much stronger for
>>>>>>>>>>>>> interactive data analysis. Click on the link for more details why,
>>>>>>>>>>>>> which are summarized in the README file.
>>>>>>>>>>>>  From the README:
>>>>>>>>>>>> "In fact, using Python without the IPython qtconsole is practically
>>>>>>>>>>>> impossible for this sort of cut and paste, interactive analysis.
>>>>>>>>>>>> The shell IPython doesn't allow it because it automatically adds
>>>>>>>>>>>> whitespace on multiline bits of code, breaking pre-formatted code's
>>>>>>>>>>>> alignment. Cutting and pasting works for the standard python shell,
>>>>>>>>>>>> but then you lose all the advantages of IPython."
>>>>>>>>>>>> You might use %cpaste in the ipython normal shell to paste without it
>>>>>>>>>>>> automatically inserting spaces:
>>>>>>>>>>>> In [5]: %cpaste
>>>>>>>>>>>> Pasting code; enter '--' alone on the line to stop.
>>>>>>>>>>>> :if 1>0:
>>>>>>>>>>>> :    print 'hi'
>>>>>>>>>>>> :--
>>>>>>>>>>>> hi
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Jason
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> SciPy-User mailing list
>>>>>>>>>>>> SciPy-User@scipy.org
>>>>>>>>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>>>>>>>> This strikes me as a textbook example of why we need an integrated
>>>>>>>>>>> formula framework in statsmodels. I'll make a pass through when I get
>>>>>>>>>>> a chance and see if there are some places where pandas would really
>>>>>>>>>>> help out.
>>>>>>>>>> We used to have a formula class is scipy.stats and I do not follow
>>>>>>>>>> nipy (http://nipy.sourceforge.net/nipy/stable/index.html) as it also
>>>>>>>>>> had this (extremely flexible but very hard to comprehend). It was what
>>>>>>>>>> I had argued was needed ages ago for statsmodel. But it needs a
>>>>>>>>>> community effort because the syntax required serves multiple
>>>>>>>>>> communities with different annotations and needs. That is also seen
>>>>>>>>>> from the different approaches taken by the stats packages from S/R,
>>>>>>>>>> SAS, Genstat (and those are just are ones I have used).
>>>>>>>>> We have held this discussion at _great_ length multiple times on the
>>>>>>>>> statsmodels list and are in the process of trying to integrate
>>>>>>>>> Charlton (from Nathaniel) and/or Formula (from Jonathan / NiPy) into
>>>>>>>>> the statsmodels base.
>>>>>>>>> http://statsmodels.sourceforge.net/dev/roadmap_todo.html#formula-framework
>>>>>>>>> and more recently
>>>>>>>>> https://groups.google.com/group/pystatsmodels/browse_thread/thread/a76ea5de9e96964b/fd85b80ae46c4931?
>>>>>>>>> https://github.com/statsmodels/formula
>>>>>>>>> https://github.com/statsmodels/charlton
>>>>>>>>> Wes and I made some effort to go through this at SciPy. From where I
>>>>>>>>> sit, I think it's difficult to disentangle the data structures from
>>>>>>>>> the formula implementation, or maybe I'd just prefer to finish
>>>>>>>>> tackling the former because it's much more straightforward. So I'd
>>>>>>>>> like to first finish the pandas-integration branch that we've started
>>>>>>>>> and then focus on the formula support. This is on my (our, I hope...)
>>>>>>>>> immediate long-term goal list. Then I'd like to come back to the
>>>>>>>>> community and hash out the 'rules of the game' details for formulas
>>>>>>>>> after we have some code for people to play with, which promises to be
>>>>>>>>> "fun."
>>>>>>>>> https://github.com/statsmodels/statsmodels/tree/pandas-integration
>>>>>>>>> FWIW, I could also improve the categorical function to be much nicer
>>>>>>>>> for the given examples (ie., take a list, drop a reference category),
>>>>>>>>> but I don't know that it's worth it, because it's really just a
>>>>>>>>> stop-gap and ideally users shouldn't have to rely on it. Thoughts on
>>>>>>>>> more stop-gap?
>>>>>>>> I want more usability, but I agree that a stop-gap probably isn't the
>>>>>>>> right way to go, unless it has things we'd eventually want anyways.
>>>>>>>>> If I understand Chris' concerns, I think pandas + formula will go a
>>>>>>>>> long way towards bridging the gap between Python and R usability, but
>>>>>>>> Yes, I agree. pandas + formulas would go a long, long way towards more
>>>>>>>> usability.
>>>>>>>> Though I really, really want a scatterplot smoother (i.e., lowess) in
>>>>>>>> statsmodels. I use it a lot, and the final part of my R file was
>>>>>>>> entirely lowess. (And, I should add, that was the part people liked
>>>>>>>> best since one of the main goals of the assignment was to generate
>>>>>>>> nifty pictures that could be used to summarize the data.)
>>>>>>> Working my way through the pull requests. Very time poor...
>>>>>>>>> it's a large effort and there are only a handful (at best) of people
>>>>>>>>> writing code -- Wes being the only one who's more or less "full time"
>>>>>>>>> as far as I can tell. The 0.4 statsmodels release should be very
>>>>>>>>> exciting though, I hope. I'm looking forward to it, at least. Then
>>>>>>>>> there's only the small problem of building an infrastructure and
>>>>>>>>> community like CRAN so we can have specialists writing and maintaining
>>>>>>>>> code...but I hope once all the tools are in place this will seem much
>>>>>>>>> less daunting. There certainly seems to be the right sentiment for it.
>>>>>>>> At the very least creating and testing models would be much simpler.
>>>>>>>> For weeks I've been wanting to see if gmm is the same as gee by
>>>>>>>> fitting both models to the same dataset, but I've been putting it off
>>>>>>>> because I didn't want to construct the design matrices by hand for
>>>>>>>> such a simple question. (GMM--Generalized Method of Moments--is a
>>>>>>>> standard econometrics model and GEE--Generalized Estimating
>>>>>>>> Equations--is a standard biostatics model. They're both
>>>>>>>> generalizations of quasi-likelihood and appear very similar, but I
>>>>>>>> want to fit some models to figure out if they're exactly the same.)
>>>>>> Since GMM is still in the sandbox, the interface is not very polished,
>>>>>> and it's missing some enhancements. I recommend asking on the mailing
>>>>>> list if it's not clear.
>>>>>> Note GMM itself is very general and will never be a quick interactive
>>>>>> method. The main work will always be to define the moment conditions
>>>>>> (a bit similar to non-linear function estimation, optimize.leastsq).
>>>>>> There are and will be special subclasses, eg. IV2SLS, that have
>>>>>> predefined moment conditions, but, still, it's up to the user do
>>>>>> construct design and instrument arrays.
>>>>>> And as far as I remember, the GMM/GEE package in R doesn't have a
>>>>>> formula interface either.
>>>>> Both of the two gee packages in R I know of have formula interfaces.
>>>>> http://cran.r-project.org/web/packages/geepack/
>>>>> http://cran.r-project.org/web/packages/gee/index.html
>>> This is very different from what's in GMM in statsmodels so far. The
>>> help file is very short, so I'm mostly guessing.
>>> It seems to be for (a subset) of generalized linear models with
>>> longitudinal/panel covariance structures. Something like this will
>>> eventually (once we get panel data models)  as a special case of GMM
>>> in statsmodels, assuming it's similar to what I know from the
>>> econometrics literature.
>>> Most of the subclasses of GMM that I currently have, are focused on
>>> instrumental variable estimation, including non-linear regression.
>>> This should be expanded over time.
>>> But GMM itself is designed for subclassing by someone who wants to use
>>> her/his own moment conditions, as in
>>> http://cran.r-project.org/web/packages/gmm/index.html
>>> or for us to implement specific models with it.
>>> If someone wants to use it, then I have to quickly add the options for
>>> the kernels of the weighting matrix, which I keep postponing.
>>> Currently there is only a truncated, uniform kernel that assumes
>>> observations are order by time, but users can provide their own
>>> weighting function.
>>> Josef
>>>> I have to look at this. I mixed up some acronyms, I meant GEL and GMM
>>>> http://cran.r-project.org/web/packages/gmm/index.html
>>>> the vignette was one of my readings, and the STATA description for GMM.
>>>> I never really looked at GEE. (That's Skipper's private work so far.)
>>>> Josef
>>>>> -Chris JS
>>>>>> Josef
>>>>>>> Oh, it's not *that* bad. I agree, of course, that it could be better,
>>>>>>> but I've been using mainly Python for my work, including GMM and
>>>>>>> estimating equations models (mainly empirical likelihood and
>>>>>>> generalized maximum entropy) for the last ~two years.
>>>>>>> Skipper
>>>>>>> _______________________________________________
>>>>>>> SciPy-User mailing list
>>>>>>> SciPy-User@scipy.org
>>>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>>> _______________________________________________
>>>>>> SciPy-User mailing list
>>>>>> SciPy-User@scipy.org
>>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>> _______________________________________________
>>>>> SciPy-User mailing list
>>>>> SciPy-User@scipy.org
>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>> just to make another point:
>> Without someone adding mixed effects, hierachical, panel/longitudinal
>> models, and .... it will not help to have a formula interface to them.
>> (Thanks to Scott we will soon have survival)
> I don't think I understand.
> I assumed that the formula framework is essentially orthogonal to the
> models themselves. In the sense that it should be simple to adapt a
> formula framework to new models. At least if they're some variety of
> linear model, and provided the formula framework is designed to allow
> for grouping syntax from the beginning. I think easy of extension to
> new models is a major goal, in fact, since we want it to be easy for
> people to contribute new models.

We still need to program the linear algebra to find the estimator, and
we need to define and calculate all the result statistics for the
different models.
(generic GLS won't work well because of the nobs*nobs covariance
matrix, I tried a little bit in the sandbox.)

As an example:   mixed effects model with REML, ...

y = X*b + Z*g, with X fixed regressors/effects and Z random effects.
assume design matrices X and Z are already constructed.

Since I don't know the statistics literature well (in contrast to
econometrics panel data), I started to translate a matlab version to
help me understand this.
But the results don't match up, and I haven't had access to matlab for
a while now.
And I think now literal translation of long matlab functions doesn't
really help, compared to writing from a good textbook with checking of
some crucial steps.

It's only 250 lines of code, but dense, and I had spent quite some time on this.
The standard solution of normal equation looks still simple, but
that's just the beginning and writing the tests often takes almost as
much time as writing the code.

My experience for the things I don't know well: It takes 2 weeks of
staring at it and playing with it, and then it ends up just as a few
lines (or a few hundred lines) of code.

The old mixed effects model with repeated measurements (EM algorithm)
based on the original formula code still sits in the sandbox. It
doesn't quite work, but the formula code makes it difficult to
understand, and it would require a week or five to cleanup, enhance,
test, ...
Since neither Skipper nor I are specifically interested (in the sense
of: It is not what we know and would use ourselves), it is still
waiting there.

The old survival is also still sitting in the sandbox, but Scott wrote
a new version without formula, I looks like it is also soon ready for
a pull request, or review leading up to a pull request. (I find
Scott's version much easier to read because it uses basic python and
numpy data structures, instead of several layers of formula


> -Chris JS
>> Josef
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mixed.py
Type: text/x-python
Size: 15804 bytes
Desc: not available
Url : http://mail.scipy.org/pipermail/scipy-user/attachments/20110829/52c55808/attachment-0001.py 

More information about the SciPy-User mailing list