# [SciPy-User] ancova with optimize.curve_fit

Skipper Seabold jsseabold@gmail....
Tue Dec 7 12:06:21 CST 2010

```On Tue, Dec 7, 2010 at 1:00 PM, Sebastian Haase <seb.haase@gmail.com> wrote:
> On Tue, Dec 7, 2010 at 5:35 PM,  <josef.pktd@gmail.com> wrote:
>> On Tue, Dec 7, 2010 at 4:58 PM, Bruce Southey <bsouthey@gmail.com> wrote:
>>> On 12/06/2010 10:00 PM, Peter Tittmann wrote:
>>>
>>> Gentlemen,
>>> I've decided to switch to the OLS method, thought I did get the NNLS method
>>> that Skipper proposed working. I was not prepared to spend more time trying
>>> to make sense of the resulting array for ancova, etc. (also could not figure
>>> out how to interpret the resulting coefficient array as I was expecting a 2d
>>> array representing the a and b coefficient values but it returned a 1d
>>> array). I have hopefully simple follow up questions:
>>> 1. Is there a method to define explicitly the function used in OLS? I know
>>> numpy.linalg.lstsq is the way OLS works but is there another function where
>>> I can define the form?
>>> 2. I'm still interested in interpreting the results of the NNLS method, so
>>> if either of you can suggest what the resulting arrays mean id be grateful.
>>> I've attached the output of NNLS
>>> warm regards,
>>> Peter
>>>
>>> Here is the working version of NNLS:
>>> def getDiam2(ht,*b):
>>>     return b[0] * ht[:,1]**b[1] + np.sum(b[2:]*ht[:,2:], axis=1)
>>> dt = np.genfromtxt('/home/peter/Desktop/db_out.csv', delimiter=",",
>>> names=True, dtype=None)
>>>
>>> plot_dummies, col_map = sm.tools.categorical(dt['plot], drop=True,
>>> dictnames=True)
>>>
>>> def nnlsDummies():
>>>     '''this function returns coefficients and covariance arrays'''
>>>     plot_dummies, col_map = sm.tools.categorical(indPlot, drop=True,
>>> dictnames=True)
>>>     X = np.column_stack((indHt, plot_dummies))
>>>     Y = depDbh
>>>     coefs, cov = curve_fit(getDiam2, X, Y, p0= [0.]*X.shape[1])
>>>     return coefs, cov
>>>
>>> --
>>> Peter Tittmann
>>>
>>> On Monday, December 6, 2010 at 4:55 PM, josef.pktd@gmail.com wrote:
>>>
>>> On Mon, Dec 6, 2010 at 7:41 PM, Skipper Seabold <jsseabold@gmail.com> wrote:
>>>
>>> On Mon, Dec 6, 2010 at 7:31 PM, Peter Tittmann wrote:
>>>> thanks both of you,
>>>> Josef, the data that I sent is only the first 100 rows of about 1500,
>>>> there
>>>> should be sufficient sampling in each plot.
>>>> Skipper, I have attempted to deploy your suggestion for not linearizing
>>>> the
>>>> data. It seems to work. I'm a little confused at your modification if the
>>>> getDiam function and I wonder if you could help me understand. The form of
>>>> the equation that is being fit is:
>>>> Y= a*X^b
>>>> your version of the detDaim function:
>>>>
>>>> def getDiam(ht, *b):
>>>>    return ht[:,0]**b[0] + np.sum(b[1:]*ht[:,1:], axis=1)
>>>>
>>>> Im sorry if this is an obvious question but I don't understand how this
>>>> works as it seems that the "a" coefficient is missing.
>>>> Thanks again!
>>>
>>> Right.  I took out the 'a', because as I read it when I linearized (I
>>> might be misunderstanding ancova, I never recall the details), if you
>>> include 'a' and also all of the dummy variables for the plot, then you
>>> will have a the problem of multicollinearity.  You could also include
>>> 'a' and drop one of the plot dummies, but then 'a' is just your
>>> reference category that you dropped.  So now b[0] is the nonlinear
>>> effect of your main variable and b[1:] contains linear shift effects
>>> of all the plots.  Hmm, thinking about it some more, though I think
>>> you could include 'a' in the non-linear version above (call it b[0]
>>> and shift everything else over by one), because now 'a' would be the
>>> effect when the current b[0] is zero.  I was just unsure how you meant
>>> 'a' when you had a*ht**b and were trying to include in ht the plot
>>> variable dummies.
>>>
>>> As I understand it, the intention is to estimate equality of the slope
>>> coefficients, so the continuous variable is multiplied with the dummy
>>> variables. In this case, the constant should still be added. The
>>> normalization question is whether to include all dummy-cont.variable
>>> products and drop the continuous variable, or include the continuous
>>> variable and drop one of the dummy-cont levels.
>>>
>>> Unless there is a strong reason to avoid log-normality of errors, I
>>> would work (first) with the linear version.
>>>
>>> Josef
>>>
>>>
>>>
>>> Skipper
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>>
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>> I do think this is starting to be an off-list discussion because this is
>>> really about statistics and not about numpy/scipy (you can contact me
>>> off-list if you want).
>>
>> If it's too much stats, we can continue on the statsmodels list. Last
>> time there was a similar question, I programmed most of the tests for
>> the linear case, and it should soon be possible to do it for the
>> non-linear case also.
>> The target is to be able to do this in 5 (or so) lines of code for the
>> linear case and maybe 10 lines for the non-linear case.
>>
>> Josef
>>
>>>
>>> I am not sure what all the variables are so please excuse me but I presume
>>> you want to model dbh as a function of height, plot and species. Following
>>> usual biostatistics interpretation, 'plot' is probably treated as random
>>> effect but you probably have to use R/SAS etc for that for both linear and
>>> nonlinear models or some spatial models.
>>>
>>> Really you need to determine whether or not a nonlinear model is required.
>>> With the few data points you provided, I only see a linear relationship
>>> between dbh and height with some outliers and perhaps some heterogeneity of
>>> variance. Often doing a simple polynomial/spline can help to see if there is
>>> any evidence for a nonlinear relationship in the full data - a linear model
>>> or polynomial with the data provided does not suggest a nonlinear model.
>>> Obviously a linear model is easier to fit and interpret especially if you
>>> create the design matrix as estimable functions (which is rather trivial
>>> once you understand using dummy variables).
>>>
>>> The most general nonlinear/multilevel model proposed is of the form:
>>> dbh= C + A*height^B
>>> Obviously if B=1 then it is a linear model and the parameters A, B and C can
>>> be modeled with a linear function of intercept, plot and species. Although,
>>> if 'plot' is what I think it is then you probably would not model the
>>> parameters A and B with it.
>>>
>>> Without C you are forcing the curve through zero which is biological
>>> feasible if you expect dbh=0 when height is zero. However, dbh can be zero
>>> if height is not zero just due to the model itself or what dbh actually is
>>> (it may take a minimum height before dbh is greater than zero). With the
>>> data you provided, there are noticeable differences between species for dbh
>>> and height so you probably need C in your model.
>>>
>>> For this general model you probably should just fit the curve for each
>>> species alone but I would use a general stats package to do this. This will
>>> give you a good starting point to know how well the curve fits each species
>>> as well as the similarity of parameters and residual variation. Getting
>>> convergence with a model that has B varying across species may be rather
>>> hard so I would suggest modeling A and C first.
>>>
>>> Bruce
>>>
> I have never heard of a "statsmodels lists" -- where and what is that !?
> Thanks,
> - Sebastian Haase