[SciPy-User] [ANN] scikit.statsmodels 0.2.0 release
Fri Feb 19 11:45:04 CST 2010
On Fri, Feb 19, 2010 at 12:19 PM, Gael Varoquaux
> On Fri, Feb 19, 2010 at 10:57:01AM -0600, Bruce Southey wrote:
>> Will it end up as cython?
> I am trying to convince the engineer who is doing the work to go down
> that way but he does like cython. I am hesitent to impose my point of
> view to a highly qualified engineer, but I don't like having this
> hand-written C bind, I must admit.
>> (I just used the supplied Python bindings of libsvm so this could be
> Well, we provide much more, like access to the weights, or vectorized
> predict :).
>> > Lets say that the focus between scikit.learn and statsmodel is most
>> > probably going to be slightly different.
>> Having done both (with papers), I find this type of comment assuming
>> because underlying both is the same concepts. What I would like to avoid
>> is having different user syntax for basic models for the same model. For
>> example, with logistic regression in SAS you have to be careful of which
>> is the default event setting as it varies across procedures. At least
>> these SAS procedures use the same unmodified dataset unlike some of the
>> R packages that do lars/lasso.
> Indeed, I agree. We'll try to look very closely at statsmodel and not
> differ if we can. However, (rant ahead), we hear this story everywhere we
> go: match our API. So we are struggling between pymvpa, mdp and statmodel
> (I am probably forgetting a few) that all differ slightly. We are willing
> to adapt as long as it is not damaging for our usecases, but it would be
> nice to have a common discussion.
> Also, there will be differences APIs, as far as I understand the
> statsmodel API. For instance, I believe that constructors of models
> should work without passing it the data (the data could be optional). The
> reason being that on-line estimators shouldn't be passed in
> initiallisation data. As a consequence, maybe the 'fit' method should
> take the data... All this is quite open to me, and I don't want to draw
> any premature conclusion.
Just a quick comment (disclaimer: all my own thoughts and
misunderstandings...feel free to correct me). Historically, the
statsmodels package accepted a design during the model instantiation
then you used your dependent variable during the fit method. To my
mind, though this didn't seem to make much sense for how I think of a
model (probably somewhat discipline specific?). For the estimators
that we have we are usually fitting a parametric model in order to
test a given theory about the data generating process. The model
doesn't make much sense to me without the data (my data is not
real-time and I am not data mining). Again though I want to make the
package as useful to others as possible (without alienating those who
think of models as I do), so of course suggestions on how to improve
the API or make it more general are more than welcome.
> We have not done any API design so far, because we are trying to
> get a feal of what the existing APIs are, and because we want to have
> working code to throw usecases at it. Also, we are extremely open to
> comments, just subscribe to the scikit.learn mailing list (not everybody
> involved with scikit learn follows this high-traffic mailing list).
>> >> What would be nice is the acceptance of input data types between learn
>> >> and statsmodels especially for things like logistic regression. While I
>> >> understand the need for duplicate functions, it may be desirable share
>> >> at least code since both code bases are still relatively 'new'.
>> > Well, as far as I am concerned, data types are numpy arrays. I am weary
>> > of implmenting higher level abstractions. Its more the APIs that may
>> > different, and that we will have to keep in sync.
>> I do agree especially now that I have learnt the 'array' approach of
>> doing things.
>> In some way my view of integration of things is Zelig -not that I have
>> really looked at it (as it is in R) :
> Well, let us try not to have to build common API and integration a
> posteriori, build right from the start. A bit of API work is well worth
> the effort, I believe. And please feal free to pitch in.
To my mind, the burden is probably more on statsmodels to provide an
interface to the learn code, as we would be more likely to take
advantage of your routines.
>> The seamless ability to link packages is rather appealing and both
>> scikits share at least numpy.
> And scipy, I believe.
> SciPy-User mailing list
More information about the SciPy-User