[SciPy-User] R vs Python for simple interactive data analysis

Wes McKinney wesmckinn@gmail....
Sat Aug 27 17:06:47 CDT 2011


On Sat, Aug 27, 2011 at 5:03 PM, Jason Grout
<jason-sage@creativetrax.com> wrote:
> On 8/27/11 1:19 PM, Christopher Jordan-Squire wrote:
>> This comparison might be useful to some people, so I stuck it up on a
>> github repo. My overall impression is that R is much stronger for
>> interactive data analysis. Click on the link for more details why,
>> which are summarized in the README file.
>
>  From the README:
>
> "In fact, using Python without the IPython qtconsole is practically
> impossible for this sort of cut and paste, interactive analysis.
> The shell IPython doesn't allow it because it automatically adds
> whitespace on multiline bits of code, breaking pre-formatted code's
> alignment. Cutting and pasting works for the standard python shell,
> but then you lose all the advantages of IPython."
>
>
>
> You might use %cpaste in the ipython normal shell to paste without it
> automatically inserting spaces:
>
> In [5]: %cpaste
> Pasting code; enter '--' alone on the line to stop.
> :if 1>0:
> :    print 'hi'
> :--
> hi
>
> Thanks,
>
> Jason
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>

This strikes me as a textbook example of why we need an integrated
formula framework in statsmodels. I'll make a pass through when I get
a chance and see if there are some places where pandas would really
help out. For example, the weighted average by sex and occupation is
what groupby is all about:

hrdf = DataFrame(hrdat)

# note DataFrame allows you to change the dtype of a column!
hrdf['sex'] = np.where(hrdf['sex'] == 1, 'male', 'female')

def compute_stats(group):
  sum_weight = group['A_ERNLWT'].sum()
  wave_hrwage = (group['hrwage'] * group['A_ERNLWT']).sum() / sum_weight
  return Series({'sum_weight' : sum_weight,
                 'wave_hrwage' : wave_hrwage})

wocc = hrdf.groupby(['sex', 'occ']).apply(compute_stats)

In [39]: wocc
Out[39]:
            sum_weight  wave_hrwage
female  1   7.669e+05   23.41
        2   1.541e+06   24.39
        3   1.082e+06   10.02
        4   6.996e+05   13.49
        5   1.325e+06   16.28
        8   5.796e+04   20.44
        9   1.277e+05   12.27
        10  1.12e+05    12.44
male    1   7.325e+05   34.96
        2   1.198e+06   29.06
        3   8.283e+05   13.45
        4   5.013e+05   20.48
        5   4.367e+05   14.96
        7   6.484e+05   17.78
        8   4.424e+05   20.39
        9   6.064e+05   17.64
        10  5.256e+05   17.76

(Of course I'm showing up some swank new pandas 0.4 stuff, i.e.
hierarchical indexing and multi-key groupby)


More information about the SciPy-User mailing list