[SciPy-User] R vs Python for simple interactive data analysis

Skipper Seabold jsseabold@gmail....
Mon Aug 29 10:10:17 CDT 2011

On Mon, Aug 29, 2011 at 10:57 AM, Christopher Jordan-Squire
<cjordan1@uw.edu> wrote:
> On Sun, Aug 28, 2011 at 2:54 PM, Skipper Seabold <jsseabold@gmail.com> wrote:
>> On Sat, Aug 27, 2011 at 10:15 PM, Bruce Southey <bsouthey@gmail.com> wrote:
>>> On Sat, Aug 27, 2011 at 5:06 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
>>>> On Sat, Aug 27, 2011 at 5:03 PM, Jason Grout
>>>> <jason-sage@creativetrax.com> wrote:
>>>>> On 8/27/11 1:19 PM, Christopher Jordan-Squire wrote:
>>>>>> This comparison might be useful to some people, so I stuck it up on a
>>>>>> github repo. My overall impression is that R is much stronger for
>>>>>> interactive data analysis. Click on the link for more details why,
>>>>>> which are summarized in the README file.
>>>>>  From the README:
>>>>> "In fact, using Python without the IPython qtconsole is practically
>>>>> impossible for this sort of cut and paste, interactive analysis.
>>>>> The shell IPython doesn't allow it because it automatically adds
>>>>> whitespace on multiline bits of code, breaking pre-formatted code's
>>>>> alignment. Cutting and pasting works for the standard python shell,
>>>>> but then you lose all the advantages of IPython."
>>>>> You might use %cpaste in the ipython normal shell to paste without it
>>>>> automatically inserting spaces:
>>>>> In [5]: %cpaste
>>>>> Pasting code; enter '--' alone on the line to stop.
>>>>> :if 1>0:
>>>>> :    print 'hi'
>>>>> :--
>>>>> hi
>>>>> Thanks,
>>>>> Jason
>>>>> _______________________________________________
>>>>> SciPy-User mailing list
>>>>> SciPy-User@scipy.org
>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>> This strikes me as a textbook example of why we need an integrated
>>>> formula framework in statsmodels. I'll make a pass through when I get
>>>> a chance and see if there are some places where pandas would really
>>>> help out.
>>> We used to have a formula class is scipy.stats and I do not follow
>>> nipy (http://nipy.sourceforge.net/nipy/stable/index.html) as it also
>>> had this (extremely flexible but very hard to comprehend). It was what
>>> I had argued was needed ages ago for statsmodel. But it needs a
>>> community effort because the syntax required serves multiple
>>> communities with different annotations and needs. That is also seen
>>> from the different approaches taken by the stats packages from S/R,
>>> SAS, Genstat (and those are just are ones I have used).
>> We have held this discussion at _great_ length multiple times on the
>> statsmodels list and are in the process of trying to integrate
>> Charlton (from Nathaniel) and/or Formula (from Jonathan / NiPy) into
>> the statsmodels base.
>> http://statsmodels.sourceforge.net/dev/roadmap_todo.html#formula-framework
>> and more recently
>> https://groups.google.com/group/pystatsmodels/browse_thread/thread/a76ea5de9e96964b/fd85b80ae46c4931?
>> https://github.com/statsmodels/formula
>> https://github.com/statsmodels/charlton
>> Wes and I made some effort to go through this at SciPy. From where I
>> sit, I think it's difficult to disentangle the data structures from
>> the formula implementation, or maybe I'd just prefer to finish
>> tackling the former because it's much more straightforward. So I'd
>> like to first finish the pandas-integration branch that we've started
>> and then focus on the formula support. This is on my (our, I hope...)
>> immediate long-term goal list. Then I'd like to come back to the
>> community and hash out the 'rules of the game' details for formulas
>> after we have some code for people to play with, which promises to be
>> "fun."
>> https://github.com/statsmodels/statsmodels/tree/pandas-integration
>> FWIW, I could also improve the categorical function to be much nicer
>> for the given examples (ie., take a list, drop a reference category),
>> but I don't know that it's worth it, because it's really just a
>> stop-gap and ideally users shouldn't have to rely on it. Thoughts on
>> more stop-gap?
> I want more usability, but I agree that a stop-gap probably isn't the
> right way to go, unless it has things we'd eventually want anyways.
>> If I understand Chris' concerns, I think pandas + formula will go a
>> long way towards bridging the gap between Python and R usability, but
> Yes, I agree. pandas + formulas would go a long, long way towards more
> usability.
> Though I really, really want a scatterplot smoother (i.e., lowess) in
> statsmodels. I use it a lot, and the final part of my R file was
> entirely lowess. (And, I should add, that was the part people liked
> best since one of the main goals of the assignment was to generate
> nifty pictures that could be used to summarize the data.)

Working my way through the pull requests. Very time poor...

>> it's a large effort and there are only a handful (at best) of people
>> writing code -- Wes being the only one who's more or less "full time"
>> as far as I can tell. The 0.4 statsmodels release should be very
>> exciting though, I hope. I'm looking forward to it, at least. Then
>> there's only the small problem of building an infrastructure and
>> community like CRAN so we can have specialists writing and maintaining
>> code...but I hope once all the tools are in place this will seem much
>> less daunting. There certainly seems to be the right sentiment for it.
> At the very least creating and testing models would be much simpler.
> For weeks I've been wanting to see if gmm is the same as gee by
> fitting both models to the same dataset, but I've been putting it off
> because I didn't want to construct the design matrices by hand for
> such a simple question. (GMM--Generalized Method of Moments--is a
> standard econometrics model and GEE--Generalized Estimating
> Equations--is a standard biostatics model. They're both
> generalizations of quasi-likelihood and appear very similar, but I
> want to fit some models to figure out if they're exactly the same.)

Oh, it's not *that* bad. I agree, of course, that it could be better,
but I've been using mainly Python for my work, including GMM and
estimating equations models (mainly empirical likelihood and
generalized maximum entropy) for the last ~two years.


More information about the SciPy-User mailing list