[SciPy-User] R vs Python for simple interactive data analysis

Christopher Jordan-Squire cjordan1@uw....
Sat Aug 27 18:30:46 CDT 2011


On Sat, Aug 27, 2011 at 6:06 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
> On Sat, Aug 27, 2011 at 5:03 PM, Jason Grout
> <jason-sage@creativetrax.com> wrote:
>> On 8/27/11 1:19 PM, Christopher Jordan-Squire wrote:
>>> This comparison might be useful to some people, so I stuck it up on a
>>> github repo. My overall impression is that R is much stronger for
>>> interactive data analysis. Click on the link for more details why,
>>> which are summarized in the README file.
>>
>>  From the README:
>>
>> "In fact, using Python without the IPython qtconsole is practically
>> impossible for this sort of cut and paste, interactive analysis.
>> The shell IPython doesn't allow it because it automatically adds
>> whitespace on multiline bits of code, breaking pre-formatted code's
>> alignment. Cutting and pasting works for the standard python shell,
>> but then you lose all the advantages of IPython."
>>
>>
>>
>> You might use %cpaste in the ipython normal shell to paste without it
>> automatically inserting spaces:
>>
>> In [5]: %cpaste
>> Pasting code; enter '--' alone on the line to stop.
>> :if 1>0:
>> :    print 'hi'
>> :--
>> hi
>>
>> Thanks,
>>
>> Jason
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>
> This strikes me as a textbook example of why we need an integrated
> formula framework in statsmodels. I'll make a pass through when I get
> a chance and see if there are some places where pandas would really
> help out. For example, the weighted average by sex and occupation is
> what groupby is all about:
>
> hrdf = DataFrame(hrdat)
>
> # note DataFrame allows you to change the dtype of a column!
> hrdf['sex'] = np.where(hrdf['sex'] == 1, 'male', 'female')
>
> def compute_stats(group):
>  sum_weight = group['A_ERNLWT'].sum()
>  wave_hrwage = (group['hrwage'] * group['A_ERNLWT']).sum() / sum_weight
>  return Series({'sum_weight' : sum_weight,
>                 'wave_hrwage' : wave_hrwage})
>
> wocc = hrdf.groupby(['sex', 'occ']).apply(compute_stats)
>
> In [39]: wocc
> Out[39]:
>            sum_weight  wave_hrwage
> female  1   7.669e+05   23.41
>        2   1.541e+06   24.39
>        3   1.082e+06   10.02
>        4   6.996e+05   13.49
>        5   1.325e+06   16.28
>        8   5.796e+04   20.44
>        9   1.277e+05   12.27
>        10  1.12e+05    12.44
> male    1   7.325e+05   34.96
>        2   1.198e+06   29.06
>        3   8.283e+05   13.45
>        4   5.013e+05   20.48
>        5   4.367e+05   14.96
>        7   6.484e+05   17.78
>        8   4.424e+05   20.39
>        9   6.064e+05   17.64
>        10  5.256e+05   17.76
>
> (Of course I'm showing up some swank new pandas 0.4 stuff, i.e.
> hierarchical indexing and multi-key groupby)
>

Nifty! I will have to look at these parts of pandas closer.

-Chris JS



 _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>


More information about the SciPy-User mailing list