[SciPy-User] R vs Python for simple interactive data analysis
Christopher Jordan-Squire
cjordan1@uw....
Mon Aug 29 15:55:08 CDT 2011
I've just pushed an updated version of the .r and .py files to github,
as well as a summary of the corrections/suggestions from the mailing
list. I'd appreciate any further comments/suggestions.
Compared to the original .r and .py files, in these revised version:
-The R code was cleaned up because I realized I didn't need to use
as.factor if I made the relevant variables into factors
-The python code was cleaned up by computing the 'sub-design matrices'
associated with each factor variable before hand and stashing
them in a dictionary
-Names were added to the variables in the regression by creating them
from the calls to sm.categorical and stashing them in a dictionary
Notably, the helper fucntions and stashing of the pieces of design matrices
simplified the calls for model fitting, but they didn't noticeably shorten
the code. They also required a small increase in complexity. (In terms of the
data structures and function calls used to create the list of names and
the design matrices.)
I also added some comments to the effect that:
*one can use paste or cpaste in the IPython shell
*np.set_printoptions or sm.iolib.SimpleTable can be used to help with
printing of numpy arrays
*names can be added by the user to regression model summaries
*one can make helper functions to construct design matrices and keep
track of names, but the simplest way of doing it isn't robust to
subset-ing the data in the presence of categorical variables
Did I miss anything?
-Chris JS
On Sat, Aug 27, 2011 at 1:19 PM, Christopher Jordan-Squire
<cjordan1@uw.edu> wrote:
> Hi--I've been a moderately heavy R user for the past two years, so
> about a month ago I took an (abbreviated) version of a simple data
> analysis I did in R and tried to rewrite as much of it as possible,
> line by line, into python using numpy and statsmodels. I didn't use
> pandas, and I can't comment on how much it might have simplified
> things.
>
> This comparison might be useful to some people, so I stuck it up on a
> github repo. My overall impression is that R is much stronger for
> interactive data analysis. Click on the link for more details why,
> which are summarized in the README file.
>
> https://github.com/chrisjordansquire/r_vs_py
>
> The code examples should run out of the box with no downloads (other
> than R, Python, numpy, scipy, and statsmodels) required.
>
> -Chris Jordan-Squire
>
More information about the SciPy-User
mailing list