[SciPy-User] R vs Python for simple interactive data analysis

Wes McKinney wesmckinn@gmail....
Sat Aug 27 18:32:22 CDT 2011


On Sat, Aug 27, 2011 at 7:30 PM, Christopher Jordan-Squire
<cjordan1@uw.edu> wrote:
> On Sat, Aug 27, 2011 at 6:06 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
>> On Sat, Aug 27, 2011 at 5:03 PM, Jason Grout
>> <jason-sage@creativetrax.com> wrote:
>>> On 8/27/11 1:19 PM, Christopher Jordan-Squire wrote:
>>>> This comparison might be useful to some people, so I stuck it up on a
>>>> github repo. My overall impression is that R is much stronger for
>>>> interactive data analysis. Click on the link for more details why,
>>>> which are summarized in the README file.
>>>
>>>  From the README:
>>>
>>> "In fact, using Python without the IPython qtconsole is practically
>>> impossible for this sort of cut and paste, interactive analysis.
>>> The shell IPython doesn't allow it because it automatically adds
>>> whitespace on multiline bits of code, breaking pre-formatted code's
>>> alignment. Cutting and pasting works for the standard python shell,
>>> but then you lose all the advantages of IPython."
>>>
>>>
>>>
>>> You might use %cpaste in the ipython normal shell to paste without it
>>> automatically inserting spaces:
>>>
>>> In [5]: %cpaste
>>> Pasting code; enter '--' alone on the line to stop.
>>> :if 1>0:
>>> :    print 'hi'
>>> :--
>>> hi
>>>
>>> Thanks,
>>>
>>> Jason
>>>
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>
>> This strikes me as a textbook example of why we need an integrated
>> formula framework in statsmodels. I'll make a pass through when I get
>> a chance and see if there are some places where pandas would really
>> help out. For example, the weighted average by sex and occupation is
>> what groupby is all about:
>>
>> hrdf = DataFrame(hrdat)
>>
>> # note DataFrame allows you to change the dtype of a column!
>> hrdf['sex'] = np.where(hrdf['sex'] == 1, 'male', 'female')
>>
>> def compute_stats(group):
>>  sum_weight = group['A_ERNLWT'].sum()
>>  wave_hrwage = (group['hrwage'] * group['A_ERNLWT']).sum() / sum_weight
>>  return Series({'sum_weight' : sum_weight,
>>                 'wave_hrwage' : wave_hrwage})
>>
>> wocc = hrdf.groupby(['sex', 'occ']).apply(compute_stats)
>>
>> In [39]: wocc
>> Out[39]:
>>            sum_weight  wave_hrwage
>> female  1   7.669e+05   23.41
>>        2   1.541e+06   24.39
>>        3   1.082e+06   10.02
>>        4   6.996e+05   13.49
>>        5   1.325e+06   16.28
>>        8   5.796e+04   20.44
>>        9   1.277e+05   12.27
>>        10  1.12e+05    12.44
>> male    1   7.325e+05   34.96
>>        2   1.198e+06   29.06
>>        3   8.283e+05   13.45
>>        4   5.013e+05   20.48
>>        5   4.367e+05   14.96
>>        7   6.484e+05   17.78
>>        8   4.424e+05   20.39
>>        9   6.064e+05   17.64
>>        10  5.256e+05   17.76
>>
>> (Of course I'm showing up some swank new pandas 0.4 stuff, i.e.
>> hierarchical indexing and multi-key groupby)
>>
>
> Nifty! I will have to look at these parts of pandas closer.
>
> -Chris JS
>
>
>
>  _______________________________________________
>> SciPy-User mailing list
>> SciPy-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>

I am working hard on documentation for all the new stuff, but I am but
one person :) I hope to have the docs (pandas.sourceforge.net , under
heavy construction at the moment) in more complete shape within a
week.

- Wes


More information about the SciPy-User mailing list