[SciPy-User] R vs Python for simple interactive data analysis

Christopher Jordan-Squire cjordan1@uw....
Mon Aug 29 10:21:35 CDT 2011


On Mon, Aug 29, 2011 at 10:10 AM, Skipper Seabold <jsseabold@gmail.com> wrote:
> On Mon, Aug 29, 2011 at 10:57 AM, Christopher Jordan-Squire
> <cjordan1@uw.edu> wrote:
>> On Sun, Aug 28, 2011 at 2:54 PM, Skipper Seabold <jsseabold@gmail.com> wrote:
>>> On Sat, Aug 27, 2011 at 10:15 PM, Bruce Southey <bsouthey@gmail.com> wrote:
>>>> On Sat, Aug 27, 2011 at 5:06 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
>>>>> On Sat, Aug 27, 2011 at 5:03 PM, Jason Grout
>>>>> <jason-sage@creativetrax.com> wrote:
>>>>>> On 8/27/11 1:19 PM, Christopher Jordan-Squire wrote:
>>>>>>> This comparison might be useful to some people, so I stuck it up on a
>>>>>>> github repo. My overall impression is that R is much stronger for
>>>>>>> interactive data analysis. Click on the link for more details why,
>>>>>>> which are summarized in the README file.
>>>>>>
>>>>>>  From the README:
>>>>>>
>>>>>> "In fact, using Python without the IPython qtconsole is practically
>>>>>> impossible for this sort of cut and paste, interactive analysis.
>>>>>> The shell IPython doesn't allow it because it automatically adds
>>>>>> whitespace on multiline bits of code, breaking pre-formatted code's
>>>>>> alignment. Cutting and pasting works for the standard python shell,
>>>>>> but then you lose all the advantages of IPython."
>>>>>>
>>>>>>
>>>>>>
>>>>>> You might use %cpaste in the ipython normal shell to paste without it
>>>>>> automatically inserting spaces:
>>>>>>
>>>>>> In [5]: %cpaste
>>>>>> Pasting code; enter '--' alone on the line to stop.
>>>>>> :if 1>0:
>>>>>> :    print 'hi'
>>>>>> :--
>>>>>> hi
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Jason
>>>>>>
>>>>>> _______________________________________________
>>>>>> SciPy-User mailing list
>>>>>> SciPy-User@scipy.org
>>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>>>
>>>>>
>>>>> This strikes me as a textbook example of why we need an integrated
>>>>> formula framework in statsmodels. I'll make a pass through when I get
>>>>> a chance and see if there are some places where pandas would really
>>>>> help out.
>>>>
>>>> We used to have a formula class is scipy.stats and I do not follow
>>>> nipy (http://nipy.sourceforge.net/nipy/stable/index.html) as it also
>>>> had this (extremely flexible but very hard to comprehend). It was what
>>>> I had argued was needed ages ago for statsmodel. But it needs a
>>>> community effort because the syntax required serves multiple
>>>> communities with different annotations and needs. That is also seen
>>>> from the different approaches taken by the stats packages from S/R,
>>>> SAS, Genstat (and those are just are ones I have used).
>>>>
>>>
>>> We have held this discussion at _great_ length multiple times on the
>>> statsmodels list and are in the process of trying to integrate
>>> Charlton (from Nathaniel) and/or Formula (from Jonathan / NiPy) into
>>> the statsmodels base.
>>>
>>> http://statsmodels.sourceforge.net/dev/roadmap_todo.html#formula-framework
>>>
>>> and more recently
>>>
>>> https://groups.google.com/group/pystatsmodels/browse_thread/thread/a76ea5de9e96964b/fd85b80ae46c4931?
>>>
>>> https://github.com/statsmodels/formula
>>> https://github.com/statsmodels/charlton
>>>
>>> Wes and I made some effort to go through this at SciPy. From where I
>>> sit, I think it's difficult to disentangle the data structures from
>>> the formula implementation, or maybe I'd just prefer to finish
>>> tackling the former because it's much more straightforward. So I'd
>>> like to first finish the pandas-integration branch that we've started
>>> and then focus on the formula support. This is on my (our, I hope...)
>>> immediate long-term goal list. Then I'd like to come back to the
>>> community and hash out the 'rules of the game' details for formulas
>>> after we have some code for people to play with, which promises to be
>>> "fun."
>>>
>>> https://github.com/statsmodels/statsmodels/tree/pandas-integration
>>>
>>> FWIW, I could also improve the categorical function to be much nicer
>>> for the given examples (ie., take a list, drop a reference category),
>>> but I don't know that it's worth it, because it's really just a
>>> stop-gap and ideally users shouldn't have to rely on it. Thoughts on
>>> more stop-gap?
>>>
>>
>> I want more usability, but I agree that a stop-gap probably isn't the
>> right way to go, unless it has things we'd eventually want anyways.
>>
>>> If I understand Chris' concerns, I think pandas + formula will go a
>>> long way towards bridging the gap between Python and R usability, but
>>
>> Yes, I agree. pandas + formulas would go a long, long way towards more
>> usability.
>>
>> Though I really, really want a scatterplot smoother (i.e., lowess) in
>> statsmodels. I use it a lot, and the final part of my R file was
>> entirely lowess. (And, I should add, that was the part people liked
>> best since one of the main goals of the assignment was to generate
>> nifty pictures that could be used to summarize the data.)
>>
>
> Working my way through the pull requests. Very time poor...

:-) Thanks Skipper!

>
>>> it's a large effort and there are only a handful (at best) of people
>>> writing code -- Wes being the only one who's more or less "full time"
>>> as far as I can tell. The 0.4 statsmodels release should be very
>>> exciting though, I hope. I'm looking forward to it, at least. Then
>>> there's only the small problem of building an infrastructure and
>>> community like CRAN so we can have specialists writing and maintaining
>>> code...but I hope once all the tools are in place this will seem much
>>> less daunting. There certainly seems to be the right sentiment for it.
>>>
>>
>> At the very least creating and testing models would be much simpler.
>> For weeks I've been wanting to see if gmm is the same as gee by
>> fitting both models to the same dataset, but I've been putting it off
>> because I didn't want to construct the design matrices by hand for
>> such a simple question. (GMM--Generalized Method of Moments--is a
>> standard econometrics model and GEE--Generalized Estimating
>> Equations--is a standard biostatics model. They're both
>> generalizations of quasi-likelihood and appear very similar, but I
>> want to fit some models to figure out if they're exactly the same.)
>>
>
> Oh, it's not *that* bad. I agree, of course, that it could be better,
> but I've been using mainly Python for my work, including GMM and
> estimating equations models (mainly empirical likelihood and
> generalized maximum entropy) for the last ~two years.
>

Yes, I didn't mean to imply it was unusable. Merely that it's kinda
time consuming but not fun to think about design matrices. I'm sure it
becomes easier if you keep doing it for awhile.

My main point was that it would be a simpler to try to put new models
into statsmodels with the formula because it'd make testing easier.
Since you could add/remove terms and interactions from the model in
attempts to break the fitting procedure.

-Chris JS

> Skipper
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>


More information about the SciPy-User mailing list