[SciPy-User] R vs Python for simple interactive data analysis

Bruce Southey bsouthey@gmail....
Sun Aug 28 20:16:08 CDT 2011


On Sun, Aug 28, 2011 at 2:54 PM, Skipper Seabold <jsseabold@gmail.com> wrote:
> On Sat, Aug 27, 2011 at 10:15 PM, Bruce Southey <bsouthey@gmail.com> wrote:
>> On Sat, Aug 27, 2011 at 5:06 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
>>> On Sat, Aug 27, 2011 at 5:03 PM, Jason Grout
>>> <jason-sage@creativetrax.com> wrote:
>>>> On 8/27/11 1:19 PM, Christopher Jordan-Squire wrote:
>>>>> This comparison might be useful to some people, so I stuck it up on a
>>>>> github repo. My overall impression is that R is much stronger for
>>>>> interactive data analysis. Click on the link for more details why,
>>>>> which are summarized in the README file.
>>>>
>>>>  From the README:
>>>>
>>>> "In fact, using Python without the IPython qtconsole is practically
>>>> impossible for this sort of cut and paste, interactive analysis.
>>>> The shell IPython doesn't allow it because it automatically adds
>>>> whitespace on multiline bits of code, breaking pre-formatted code's
>>>> alignment. Cutting and pasting works for the standard python shell,
>>>> but then you lose all the advantages of IPython."
>>>>
>>>>
>>>>
>>>> You might use %cpaste in the ipython normal shell to paste without it
>>>> automatically inserting spaces:
>>>>
>>>> In [5]: %cpaste
>>>> Pasting code; enter '--' alone on the line to stop.
>>>> :if 1>0:
>>>> :    print 'hi'
>>>> :--
>>>> hi
>>>>
>>>> Thanks,
>>>>
>>>> Jason
>>>>
>>>> _______________________________________________
>>>> SciPy-User mailing list
>>>> SciPy-User@scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>
>>>
>>> This strikes me as a textbook example of why we need an integrated
>>> formula framework in statsmodels. I'll make a pass through when I get
>>> a chance and see if there are some places where pandas would really
>>> help out.
>>
>> We used to have a formula class is scipy.stats and I do not follow
>> nipy (http://nipy.sourceforge.net/nipy/stable/index.html) as it also
>> had this (extremely flexible but very hard to comprehend). It was what
>> I had argued was needed ages ago for statsmodel. But it needs a
>> community effort because the syntax required serves multiple
>> communities with different annotations and needs. That is also seen
>> from the different approaches taken by the stats packages from S/R,
>> SAS, Genstat (and those are just are ones I have used).
>>
>
> We have held this discussion at _great_ length multiple times on the
> statsmodels list and are in the process of trying to integrate
> Charlton (from Nathaniel) and/or Formula (from Jonathan / NiPy) into
> the statsmodels base.
>
> http://statsmodels.sourceforge.net/dev/roadmap_todo.html#formula-framework
>
> and more recently
>
> https://groups.google.com/group/pystatsmodels/browse_thread/thread/a76ea5de9e96964b/fd85b80ae46c4931?
>
> https://github.com/statsmodels/formula
> https://github.com/statsmodels/charlton
>
> Wes and I made some effort to go through this at SciPy. From where I
> sit, I think it's difficult to disentangle the data structures from
> the formula implementation, or maybe I'd just prefer to finish
> tackling the former because it's much more straightforward. So I'd
> like to first finish the pandas-integration branch that we've started
> and then focus on the formula support. This is on my (our, I hope...)
> immediate long-term goal list. Then I'd like to come back to the
> community and hash out the 'rules of the game' details for formulas
> after we have some code for people to play with, which promises to be
> "fun."
>
> https://github.com/statsmodels/statsmodels/tree/pandas-integration
>
> FWIW, I could also improve the categorical function to be much nicer
> for the given examples (ie., take a list, drop a reference category),
> but I don't know that it's worth it, because it's really just a
> stop-gap and ideally users shouldn't have to rely on it. Thoughts on
> more stop-gap?
>
> If I understand Chris' concerns, I think pandas + formula will go a
> long way towards bridging the gap between Python and R usability, but
> it's a large effort and there are only a handful (at best) of people
> writing code -- Wes being the only one who's more or less "full time"
> as far as I can tell. The 0.4 statsmodels release should be very
> exciting though, I hope. I'm looking forward to it, at least. Then
> there's only the small problem of building an infrastructure and
> community like CRAN so we can have specialists writing and maintaining
> code...but I hope once all the tools are in place this will seem much
> less daunting. There certainly seems to be the right sentiment for it.
>
> Skipper
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>

Thanks for the info!

Actually it is impossible to "disentangle the data structures from the
formula implementation". You have to make design designs for example
defining factors- R does that in the dataframe (as.factor() is not
part of the formula), SAS using class statements etc.


Bruce


More information about the SciPy-User mailing list