# [SciPy-User] ancova with optimize.curve_fit

josef.pktd@gmai... josef.pktd@gmai...
Tue Dec 7 10:35:38 CST 2010

```On Tue, Dec 7, 2010 at 4:58 PM, Bruce Southey <bsouthey@gmail.com> wrote:
> On 12/06/2010 10:00 PM, Peter Tittmann wrote:
>
> Gentlemen,
> I've decided to switch to the OLS method, thought I did get the NNLS method
> that Skipper proposed working. I was not prepared to spend more time trying
> to make sense of the resulting array for ancova, etc. (also could not figure
> out how to interpret the resulting coefficient array as I was expecting a 2d
> array representing the a and b coefficient values but it returned a 1d
> array). I have hopefully simple follow up questions:
> 1. Is there a method to define explicitly the function used in OLS? I know
> numpy.linalg.lstsq is the way OLS works but is there another function where
> I can define the form?
> 2. I'm still interested in interpreting the results of the NNLS method, so
> if either of you can suggest what the resulting arrays mean id be grateful.
> I've attached the output of NNLS
> warm regards,
> Peter
>
> Here is the working version of NNLS:
> def getDiam2(ht,*b):
>     return b[0] * ht[:,1]**b[1] + np.sum(b[2:]*ht[:,2:], axis=1)
> dt = np.genfromtxt('/home/peter/Desktop/db_out.csv', delimiter=",",
> names=True, dtype=None)
>
> plot_dummies, col_map = sm.tools.categorical(dt['plot], drop=True,
> dictnames=True)
>
> def nnlsDummies():
>     '''this function returns coefficients and covariance arrays'''
>     plot_dummies, col_map = sm.tools.categorical(indPlot, drop=True,
> dictnames=True)
>     X = np.column_stack((indHt, plot_dummies))
>     Y = depDbh
>     coefs, cov = curve_fit(getDiam2, X, Y, p0= [0.]*X.shape[1])
>     return coefs, cov
>
> --
> Peter Tittmann
>
> On Monday, December 6, 2010 at 4:55 PM, josef.pktd@gmail.com wrote:
>
> On Mon, Dec 6, 2010 at 7:41 PM, Skipper Seabold <jsseabold@gmail.com> wrote:
>
> On Mon, Dec 6, 2010 at 7:31 PM, Peter Tittmann wrote:
>> thanks both of you,
>> Josef, the data that I sent is only the first 100 rows of about 1500,
>> there
>> should be sufficient sampling in each plot.
>> Skipper, I have attempted to deploy your suggestion for not linearizing
>> the
>> data. It seems to work. I'm a little confused at your modification if the
>> getDiam function and I wonder if you could help me understand. The form of
>> the equation that is being fit is:
>> Y= a*X^b
>> your version of the detDaim function:
>>
>> def getDiam(ht, *b):
>>    return ht[:,0]**b[0] + np.sum(b[1:]*ht[:,1:], axis=1)
>>
>> Im sorry if this is an obvious question but I don't understand how this
>> works as it seems that the "a" coefficient is missing.
>> Thanks again!
>
> Right.  I took out the 'a', because as I read it when I linearized (I
> might be misunderstanding ancova, I never recall the details), if you
> include 'a' and also all of the dummy variables for the plot, then you
> will have a the problem of multicollinearity.  You could also include
> 'a' and drop one of the plot dummies, but then 'a' is just your
> reference category that you dropped.  So now b[0] is the nonlinear
> effect of your main variable and b[1:] contains linear shift effects
> of all the plots.  Hmm, thinking about it some more, though I think
> you could include 'a' in the non-linear version above (call it b[0]
> and shift everything else over by one), because now 'a' would be the
> effect when the current b[0] is zero.  I was just unsure how you meant
> 'a' when you had a*ht**b and were trying to include in ht the plot
> variable dummies.
>
> As I understand it, the intention is to estimate equality of the slope
> coefficients, so the continuous variable is multiplied with the dummy
> variables. In this case, the constant should still be added. The
> normalization question is whether to include all dummy-cont.variable
> products and drop the continuous variable, or include the continuous
> variable and drop one of the dummy-cont levels.
>
> Unless there is a strong reason to avoid log-normality of errors, I
> would work (first) with the linear version.
>
> Josef
>
>
>
> Skipper
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
> I do think this is starting to be an off-list discussion because this is
> really about statistics and not about numpy/scipy (you can contact me
> off-list if you want).

If it's too much stats, we can continue on the statsmodels list. Last
time there was a similar question, I programmed most of the tests for
the linear case, and it should soon be possible to do it for the
non-linear case also.
The target is to be able to do this in 5 (or so) lines of code for the
linear case and maybe 10 lines for the non-linear case.

Josef

>
> I am not sure what all the variables are so please excuse me but I presume
> you want to model dbh as a function of height, plot and species. Following
> usual biostatistics interpretation, 'plot' is probably treated as random
> effect but you probably have to use R/SAS etc for that for both linear and
> nonlinear models or some spatial models.
>
> Really you need to determine whether or not a nonlinear model is required.
> With the few data points you provided, I only see a linear relationship
> between dbh and height with some outliers and perhaps some heterogeneity of
> variance. Often doing a simple polynomial/spline can help to see if there is
> any evidence for a nonlinear relationship in the full data - a linear model
> or polynomial with the data provided does not suggest a nonlinear model.
> Obviously a linear model is easier to fit and interpret especially if you
> create the design matrix as estimable functions (which is rather trivial
> once you understand using dummy variables).
>
> The most general nonlinear/multilevel model proposed is of the form:
> dbh= C + A*height^B
> Obviously if B=1 then it is a linear model and the parameters A, B and C can
> be modeled with a linear function of intercept, plot and species. Although,
> if 'plot' is what I think it is then you probably would not model the
> parameters A and B with it.
>
> Without C you are forcing the curve through zero which is biological
> feasible if you expect dbh=0 when height is zero. However, dbh can be zero
> if height is not zero just due to the model itself or what dbh actually is
> (it may take a minimum height before dbh is greater than zero). With the
> data you provided, there are noticeable differences between species for dbh
> and height so you probably need C in your model.
>
> For this general model you probably should just fit the curve for each
> species alone but I would use a general stats package to do this. This will
> give you a good starting point to know how well the curve fits each species
> as well as the similarity of parameters and residual variation. Getting
> convergence with a model that has B varying across species may be rather
> hard so I would suggest modeling A and C first.
>
> Bruce
>
>
>
>
>
>
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
```