# [SciPy-user] Ols for np.arrays and masked arrays

josef.pktd@gmai... josef.pktd@gmai...
Mon Jan 19 14:30:57 CST 2009

```On Mon, Jan 19, 2009 at 11:53 AM, Bruce Southey <bsouthey@gmail.com> wrote:
> Hi,
> I do not want to sound overly critical as I would like to assist with this.
>
> I am not sure of what your design goal is and what should the Regression
> class actually contain and do.  Do you want something like an R lm object?
>

After our previous discussion, I was thinking how a more useful
interface for regression analysis could look like (without using a
full formula framework).
The ols estimation part was just a quick example for a regression. The
main purpose of this is to find a good interface.

My basic idea was to create a general regression class, that does the
initialization and common calculation and the specific estimation is
implemented in the subclasses, e.g. OLS, GLS, Ridge,...(?). For the
basic statistics, which are calculated on demand, I'm planning
something like http://www.scipy.org/Cookbook/OLS. I would like also to
get a similar wrapper  for non-linear least squares, optimize.leastsq,
because that doesn't produce the (normalized) parameter standard
errors, or for non-linear maximum likelihood estimator. Also, I don't
want to duplicate work in stats.models.

> But, I do not like that your _init_ function does so much work that
> probably does not belong there. I would have thought that it would just
> initialize certain important variables including the solutions (b). One
> reasons is perhaps you just want to update the input arrays but this
> design forces you to create a new object.
>
> At what stage should you check for valid inputs and the correct
> dimensions of x?

I think this is the main point of the __init__ function, to transform
the data from a convenient specification for the user to the form that
is used by the estimation procedures. If the design matrix is not
already a "clean" array, then it needs to copied at least once to be
handed to linalf (as far as I understand this)

>
> I would also prefer the object having the standard errors of the
> solutions and eventually other 'useful' statistics like sum of squares,
> R-squared. However, I do not know how to the standard errors using
> linalg.lstsq (I vaguely recall there is another way that would work).
> The others are easy to get probably on demand like R's summary function.

My current thinking is that some basic statistics, such as estimated
parameters, standard errors, pvalues and R^2 are returned immediately.
More in depth analysis of the residuals are returned on demand in a
special result structure (class) (?).

Just to see how it works, I copied the cookbook ols recipe into my ols
class, I had to adjust the dimension (1D to 2D columns). It looks ok,
but I didn't test more than making sure it runs and looks reasonable.
Several of the test functions can go into the regression class, since
they apply also for other regression models not just OLS.

The current version takes one or several arrays as inputs. Inputs can
be 1D or 2D, and can be either numpy arrays or masked arrays. All
observations with at least one masked or nan value are removed from
the regression. Other array subclasses are not yet included.
Predicted values, called yhat, are masked arrays if the input was
masked or contained nans. If input are plain  numpy arrays with finite
values, then the predicted value array, yhat, is also a np array. (I
haven't checked the copied part yet, eg. the error vector is on the
compressed values)

I attached the updated version. lm in R looks a bit like an umbrella
function, but I have not used it enough to tell what useful features
should be copied by a regression class. Generalized linear models will
be in stats.models.

Josef
-------------- next part --------------
import numpy as np
import numpy.ma as ma
import numpy.random as mtrand #import randn, seed
import matplotlib.pylab as plt
import time
from scipy import linalg, stats
#from numpy.testing import assert_almost_equal
from numpy.ma.testutils import assert_almost_equal

import regressionanalysis_ma2 as _ini

class Regression(object):
def __init__(self,y,*args,**kwds):
'''
Parameters
----------

y: array, 1D or 2D
variable of regression. If 2D, then it needs to be one column
args: sequence of 1D or 2D arrays
one or several arrays of regressors. If 2D than interpretation
is that each column represents one variable and rows are
observations
if True then a constant is added to the regressor

return
------
class instance with regression results

Notes
-----

Observation (rows) that contain masked values in a masked array or nans
in any of the regressors or in y will be dropped for the calculation.
Arrays that correspond to observation, e.g. estimated y (yhat)
or residuals, are returned as masked arrays if masked values or nans are
in the input data.

Usage
-----
estm = Ols(y, x1, x2)
estm.b  # estimated parameter vector

example polynomial
estv = Ols(y, np.vander(x[:,0],3), x[:,1], addconst=False)
estv.b  # estimated parameter vector
estv.yhat # fitted values

'''
self.y_varnm = kwds.pop('y_varnm','y')
self.x_varnm = kwds.pop('x_varnm',[])
if not isinstance(self.x_varnm,list):
self.x_varnm = list(self.x_varnm)
# Make sure that any NaN in y and args are masked, and force masked arrays
y = ma.fix_invalid(y)
if y.ndim == 1:
y.shape = (-1, 1)
design = list(args)
#design = [ma.fix_invalid(x) for x in args]
# We need a constant: add a column of 1
design.insert(0, np.ones(y.shape))
self.x_varnm.insert(0, 'const')

# stack first than fix; saves one copy, maybe not
design = ma.column_stack(design)  # convert types and make copy
#does it make copy if args = x single 2D array
design = ma.fix_invalid(design, copy=False)

# And now, drop them
self.y = y.data[valid,:]
self.design = design.data[valid]
self.x = self.design    # copy to use cookbook ols

for ii in range(len(self.x_varnm),self.design.shape[1]+1):
self.x_varnm.append('var%002d'%ii)
#
#self.yhat = None
self.estimate()

def estimate(self):
pass

def get_yhat(self):
pass

def summary(self):
pass

def predict(self, x):
pass

class Ols(Regression):
def __init__(self,y,*args,**kwds):
Regression.__init__(self, y, *args, **kwds)

def estimate(self):
(y, x) = self.y, self.design
#print 'y.shape, x.shape'
#print y.shape, x.shape
(self.b, self.resid, rank, self.sigma) = linalg.lstsq(x, y)
#
##            yhat[~mask, :] = np.dot(self.design, self.b)
##        else:
##            yhat[:,:] = np.dot(self.design, self.b)
#self.yhat = yhat
self.yhat = self.predict()
#print rank

# estimating coefficients, and basic stats
self.inv_xx = linalg.pinv(np.dot(self.x.T,self.x)) # use Moore-Penrose pseudoinverse
xy = np.dot(self.x.T,self.y)                   # estimate coefficients
#print self.b.shape   # column vector
self.nobs = self.y.shape[0]                     # number of observations
self.ncoef = self.x.shape[1]                    # number of coef.
self.df_e = self.nobs - self.ncoef              # degrees of freedom, error
self.df_r = self.ncoef - 1                      # degrees of freedom, regression

self.e = self.y - np.dot(self.x,self.b)            # residuals
self.sse = np.dot(self.e.T,self.e)/self.df_e         # SSE
self.se = np.sqrt(np.diagonal(self.sse*self.inv_xx))[:,np.newaxis]  # coef. standard errors
self.t = self.b / self.se                       # coef. t-statistics
self.p = (1-stats.t.cdf(np.abs(self.t), self.df_e)) * 2    # coef. p-values

self.R2 = 1 - self.e.var()/self.y.var()         # model R-squared

self.F = (self.R2/self.df_r) / ((1-self.R2)/self.df_e)  # model F-statistic
self.Fpv = 1-stats.f.cdf(self.F, self.df_r, self.df_e)  # F-statistic p-value

def dw(self):
"""
Calculates the Durbin-Waston statistic
"""
de = np.diff(self.e,1,axis=0)
dw = np.dot(de.T,de) / np.dot(self.e.T,self.e);

return dw

def omni(self):
"""
Omnibus test for normality
"""
return stats.normaltest(self.e)

def JB(self):
"""
Calculate residual skewness, kurtosis, and do the JB test for normality
"""

# Calculate residual skewness and kurtosis
skew = stats.skew(self.e)
kurtosis = 3 + stats.kurtosis(self.e)

# Calculate the Jarque-Bera test for normality
JB = (self.nobs/6) * (np.square(skew) + (1/4)*np.square(kurtosis-3))
JBpv = 1-stats.chi2.cdf(JB,2);

return JB, JBpv, skew, kurtosis

def ll(self):
"""
Calculate model log-likelihood and two information criteria
"""

# Model log-likelihood, AIC, and BIC criterion values
ll = -(self.nobs*1/2)*(1+np.log(2*np.pi)) - \
(self.nobs/2)*np.log(np.dot(self.e.T,self.e)/self.nobs)
aic = -2*ll/self.nobs + (2*self.ncoef/self.nobs)
bic = -2*ll/self.nobs + (self.ncoef*np.log(self.nobs))/self.nobs

return ll, aic, bic

def summary(self):
"""
Printing model output to screen
"""

# local time & date
t = time.localtime()

# extra stats
ll, aic, bic = self.ll()
JB, JBpv, skew, kurtosis = self.JB()
omni, omnipv = self.omni()

# printing output to screen
print '\n=============================================================================='
print "Dependent Variable: " + self.y_varnm
print "Method: Ordinary Least Squares"
print "Date: ", time.strftime("%a, %d %b %Y",t)
print "Time: ", time.strftime("%H:%M:%S",t)
print '# obs:           %5.0f' % self.nobs
print '# variables:     %5.0f' % self.ncoef
print '=============================================================================='
print 'variable     coefficient     std. Error    t-statistic       prob.'
print '=============================================================================='
for i in range(self.ncoef):
#print self.x_varnm[i],self.b[i],self.se[i],self.t[i],self.p[i]
print '''% -5s          % 9.4f     % 9.4f     % 9.4f     % 9.4f''' % tuple(
[self.x_varnm[i],self.b.ravel()[i],self.se.ravel()[i],self.t.ravel()[i],self.p.ravel()[i]])
print '=============================================================================='
print 'Models stats                           Residual stats'
print '=============================================================================='
print 'R-squared            % 9.4f         Durbin-Watson stat  % 9.4f' % tuple([self.R2, self.dw()])
print 'Adjusted R-squared   % 9.4f         Omnibus stat        % 9.4f' % tuple([self.R2adj, omni])
print 'F-statistic          % 9.4f         Prob(Omnibus stat)  % 9.4f' % tuple([self.F, omnipv])
print 'Prob (F-statistic)   % 9.4f         JB stat             % 9.4f' % tuple([self.Fpv, JB])
print 'Log likelihood       % 9.4f         Prob(JB)            % 9.4f' % tuple([ll, JBpv])
print 'AIC criterion        % 9.4f         Skew                % 9.4f' % tuple([aic, skew])
print 'BIC criterion        % 9.4f         Kurtosis            % 9.4f' % tuple([bic, kurtosis])
print '=============================================================================='

def predict(self, x=None):
'''redurn prediction
todo: add prediction error, confidence intervall'''
if x is None:
x = self.design
yest = np.dot(x, self.b)
# (not self.mask is None) for shorthand, not used currently
return output
else:
return yest

else:
x = ma.fix_invalid(x)
x = ma.column_stack((ma.ones(x.shape[0]), x))
output = ma.dot(x, self.b)
#maybe detend
output = output.data
return output

def cookbook_example():
xxsingular = False#True
x = np.linspace(0, 15, 40)
a,b,c = 3.1, 42, -304.2
y_true = a*x**2 + b*x + c
y_meas = y_true + 100.01*np.random.standard_normal( y_true.shape )
if xxsingular:
xx = np.c_[x**2,x,2*x,np.ones(x.shape[0])]
else:
xx = np.c_[x**2,x,np.ones(x.shape[0])]

x_varnm = ['const', 'x1','x2','x3','x4']

k = xx.shape[1]
#m = Ols(y_meas,xx,y_varnm = 'y',x_varnm = x_varnm[:k-1],addconst = False)
m = Ols(y_meas,xx,y_varnm = 'y',x_varnm = x_varnm[:k],addconst = False)
m.summary()

#matplotlib ploting

plt.title('Linear Regression Example')
plt.plot(x,y_true,'g.--')
plt.plot(x,y_meas,'k.')
plt.plot(x,y_meas-m.e.flatten(),'r.-')
plt.legend(['original','plus noise', 'regression'], loc='lower right')
np.testing.assert_almost_equal(y_meas[:,np.newaxis]-m.e,m.yhat)
return m

if __name__ == '__main__':

import numpy as np
#from olsexample import ols

def generate_data(nobs):
x = np.random.randn(nobs,2)
btrue = np.array([[5,1,2]]).T
y = np.dot(x, btrue[1:,:]) + btrue[0,:] + 0.5 * np.random.randn(nobs,1)
return y,x

y,x = generate_data(15)

#benchmark no masked arrays, and one 2D array for x
est = Ols(y[1:,:],x[1:,:])  # initialize and estimate with ols, constant added by default
_est = _ini.Ols(y[1:,:],x[1:,:])  # initialize and estimate with ols, constant added by default
print 'ols estimate'
est.estimate()
print est.b.T
print _est.b.T
#print np.array([[5,1,2]])  # true coefficients

ynew,xnew = generate_data(3)
ypred = est.predict(xnew)

print '    ytrue        ypred        error'
print np.c_[ynew, ypred, ynew - ypred]

ypred = _est.predict(xnew)
print np.c_[ynew, ypred, ynew - ypred]

ym = y.copy()
ym[0,:] = np.nan
estm1 = Ols(ym,x)
_estm1 = _ini.Ols(ym,x)
print estm1.b.T
print estm1.yhat.shape
print _estm1.b.T
print _estm1.yhat.shape
print 'yhat'
print estm1.yhat[:10,:]
print _estm1.yhat[:10,:]
assert_almost_equal(estm1.yhat[1:,:], est.yhat)

print estm2.b.T
print _estm2.b.T
assert_almost_equal(estm2.b, estm1.b)
assert_almost_equal(estm2.yhat, estm1.yhat)

#masked y and several x args,
estm3 = Ols(ym,x[:,0],x[:,1])
_estm3 = _ini.Ols(ym,x[:,0],x[:,1])
print estm3.b.T
print _estm3.b.T
assert_almost_equal(estm3.b, estm1.b)
assert_almost_equal(estm3.yhat, estm2.yhat)

#masked array in y and one x variable
x_0 = x[:,0].copy()  # is copy necessary?
x_0[0] = np.nan
estm4 = Ols(ym,x_0,x[:,1])
_estm4 = Ols(ym,x_0,x[:,1])
print estm4.b.T
print _estm4.b.T
assert_almost_equal(estm4.b, estm1.b)
assert_almost_equal(estm4.yhat, estm1.yhat)

#masked array in one x variable, but not in y
x_0 = x[:,0].copy()  # is copy necessary?
x_0[0] = np.nan
print estm5.b.T
print _estm5.b.T
assert_almost_equal(estm5.b, estm1.b)
assert_almost_equal(estm5.yhat, estm2.yhat)
#assert np.all(estm5.yhat == estm1.yhat)

#example polynomial
print 'example with one polynomial x added'