[SciPy-user] predicting values based on (linear) models
Thu Jan 15 11:36:43 CST 2009
>> There are different reasons for a lack of user base. One of the reasons
>> for R is that many, many statistics classes use it.
>> Some of the reasons that I do not use scipy for stats (and have not
>> looked at this in some time) included:
>> 1) The difficulty of installation which is considerably better now.
>> 2) Lack of support for missing values as virtually everything that I
>> have worked with involves missing values at some stage.
>> 3) Lack of an suitable statistical modeling interface where you can
>> specify the model to be fit without having to create each individual
>> array. The approach must work for a range of scenarios.
> With 2 and 3 I have little experience
> Missing observations, I usually remove or clean in the initial data
> preparation. mstats provides functions for masked arrays, but stats
> mostly assumes no missing values. What would be the generic treatment
> for missing observations, just dropping all observations that have
> NaNs or converting them to masked arrays and expand the function that
> can handle those?
No! We have had considerable discussion on this aspect in the past on
the numpy/scipy lists. Basically a missing observation should not be
treated as an NaNs (and there are different types of NaNs) because they
are not the same. In some cases, missing values disappear in the
calculations such as creating the X'X matrix etc but you probably do not
want that if you have real NaNs in your data (say after taking square
root of an array that includes negative numbers).
> Jonathan Taylor included a formula framework in stats.models similar
> to R, but I haven't looked very closely at it. I haven't learned much
> of R's syntax and I usually prefer to build by own arrays (with some
> exceptions such as polynomials) than hide them behind a mini model
> For both stats.models and for the interface for general stats
> functions, feedback would be very appreciated.
> SciPy-user mailing list
If you look at R's lm function you can see that you can fit a model
using a formula. Without a similar framework, you can not do useful
stats. Also you must have a 'mini model language' because the inputs
must be created correctly and it gets very repetitive very quickly.
For example, in R (and all major stats languages like SAS) you can just
fit regression models like lm(Y~ x2) and lm( Y~ x3 + x1), where Y, x1,
x2, and x3 are with the appropriate dataframe (not necessarily in that
If I understand mstats.linregress correctly, I have to create two arrays
just to fit one of these two models. In the second case, I have to
create yet another array. If I have my original data in one array, now I
have unnecessarily duplicated 3 columns of that array not to mention had
to do all this extra work, hopefully error free, just to do 2 lines of R
Jonathan's formula is along the right approach but, based on the doc
string, rather cumbersome and does not use array inputs. It probably
would be more effective with a record masked array.
PS Way back when I did give feedback to the direction of stats stuff.
More information about the SciPy-user