RA ch11 Remedial measures

Ch 11: Remedial measures

Transformation is one of the standard remedial measure for a linear model. Recall that its uses are:

  1. to linearize the regression relationship
  2. to make the error distribution more nearly normal
  3. to make the variances of the error terms more nearly equal.

In this chapter, we will find additional remedial measures to handle several pitfalls. Then we discuss non-parametric regression methods (which are quite different from previous regression models). A common feature of the remedial measures and alternative regression methods is that estimation procedures from the ways we’ve seen are relatively complex, so we need easier and more generic way to evaluate the precision of these complex estimators. Bootstrapping is one example.

Outline

  1. Other remedial measures

    • unequal error variance - weighted least squares
    • multicollinearity - ridge regression
    • influential cases - robust regression
  2. Nonparametric regression

    • regression trees
  3. Bootstrap confidence intervals


1. Other remedial measures

unequal error variance - weighted least squares

Generalized multiple regression model Yi=β0+β1Xi1++βp1Xi,p1+ϵi where ei’s are independent, eiN(0,σi2), $$i = 1,2,\cdots, n

.Here,theerrorvariancesmaynotbeequal,sothat\sigma^2(\epsilon)isann\times ndiagonalmatrixwhereeachentryhas\sigma_i^2$$.

When we use ordinary least square estimators, then still we have these properties:

  1. the estimators of β are unbiased
  2. the estimators are consistent consistency is defined that ε>0,P(|bnβ|ε)=1 for succiciently large n (bn converges to β). However, the estimators are no longer have minimum variance. This is due to the unequal error variances, making n cases no longer have the same reliability. In other words, observations with small variances provide more reliable information.

A remedy to handle this problem is to use weighted least squares. First, we start from the simplest case.

when error variances are known Suppose that the errror variances σi2 for all n observations are given. Then we use Maximum likelihood method, again. First, we define the likelihood function L(β).

L(β)=i=1n1(2πσi2)12exp[12σi2(Yiβ0β1Xi1βp1Xi,p1)2]

Here, we substitute wi=1σi2 and rewrite the formula.

L(β)=i=1n(wi2π)12exp[wi2(Yiβ0β1Xi1βp1Xi,p1)2]

Since σi2’s are known, parameters for likelihood function are just regression coefficients. Maximum likelihood method finds critical points which maximizes the likelihood function value.

This is just as the same as minimizing Qw=i=1nwi(Yiβ0β1Xi1βp1Xi,p1)2. The normal equation can be written using weights matrix W=diag(w1,w2,,wn) as below:

XtWXbw=XtWY

bw=(XtWX)1XtWY where σ2(bw)=(XtWX)1. This is called Weighted least square estimator.

when variances are known up to proportionality constant Now, relaxing the condition, assume that we only know proportions among error variances. Consider a case that the second error variance is twice larger than the first one. i.e. σ22=2σ12. Then we can write wi’s in this form: wi=1kiσ2 where σ2 is unknown true error variance of ϵ1. Then, p×p matrix σ2(bw)=σ2(XtKX)1, where K=diag(k1,k2,,kn). Since σ2 is unknown, we use s(bw)=MSEw(XtKX)1, instead, where MSEw=i=1nwi(YiYi^)2np=i=1nwiei2np.

when error variances are unknown We use estimators of the error variances. There are two approaches.

  1. Estimation by residuals σi2=E(ϵ2)[E(ϵ)]2=E(ϵ2). Hence, the squared residual ei2 is an estimator of σi2.
  2. Use of replicates or near replicates
    • In a designed experiment (such as scientific simulations), replicate observations could be made. If the number of replicates is large, use the fact that sample variance gets closer to variance.
    • However, in an observational studies (such as social studies), we use near observations as replicates.

Inference procedures should be followed when weights are estimated. We want to estimate a variance-covariance matrix σ2(bw). The confidence intervals for regression coefficients are estimated with s2(bwk) where k stands for a bootstrap index. Then apply bootstraping method.

Lastly, we can use the ordinary least squares leaving the error variances unequal. Regression coefficient b is still unbiased and consistent, but no longer a minimum variance estimator, where its variance is given as σ2(b)=(XtX)1Xtσ2(ϵ)X(XtX)1 and it is estimated by s2(b)=(XtX)1XtS0X(XtX)1 where S0=diag(e12,e22,,en2).

Remark on weighted least squares Weighted least squares can be interpreted as transforming the data (Xi,Yi,ϵi) by W12. This can be written as

W12Y=W12Xβ+W12ϵ

and by setting Yw=W12Y, Xw=W12X, and ϵw=W12ϵ, the transformed variables follow the ordinary linear regression model

Yw=Xwβ+ϵw

where E(ϵw)=W12E(ϵ)=W120=0 and σ2(ϵw)=W12σ2(ϵ)W12=W12W1W12=I. The regression coefficients are given as bw=(XwtXw)1XwtYw.

multicollinearity - ridge regression

When multicollineariry is detected in a regression model, we have fixed the model by

  1. using centered data in polynomial regression models
  2. drop some redundant predictor variables
  3. add some observations that could break the multicollinearity
  4. use pricipal components instead of the current variables Xk
  5. Ridge regression.

Ridge regression perturb the estimated coefficient bR from the unbiased estimator b, and therefore it could remedy multicollinearity problem. In other words, our unbiased estimator b is obtained by the observation X with multicollinearity, but slightly perturbed estimator bR is obtained by a virtual perturbed observation XR without multicollinearity. When perturbing bR, we add restriction on bR to have small norm(length). Length restriction makes the variance of the sampling distribution of bR shrink. Although, bR is no longer an unbiased estimator, we have more chance to obtain E(bR) from a sampled bR compared to obtaining E(b)=β from a sampled b. The relationship is given as a figure.

ridge regression

To obtain bR, we consider using correlation transformed X observations. This process normalizes the variable X adjusting the magnitude of SSE=(YXβ)t(YXβ). bR can be obtained by minimizing a modified sum suquared error function Q(β)=(YXβ)t(YXβ)+cβtβ where c is a given biasing coefficient and c0. Squared 2 norm of β is added in the function. This allows to minimize both SSE and the length of β, at the same time. The intensity to keep focus on minimizing the length of β can be adjusted by choosing the coefficient c. Larger the value c, the minimizing process would be more focused on minimizing the length of β. Resulting normal equation is (XtX+cI)bR=XtY. Therefore, bR=(XtX+cI)1XtY.

Choice of biasing constant c As the biasing constant c increases, the bias of the model would be increased. But there exists a value of c for which the regression estimator bR has a smaller total mean squared error E(bRβ)2 than the ordinary LSE b. But optimum value of c varies from one application to another and it is unknown. Common heuristics to determine an appropriate c are:

  1. using Ridge trace Try several c and plot bR’s according to values of 0c1. Take the point c where the fluctuation in bR ceases.
ridge trace
  1. using VIFk Try several c and plot VIFk’s according to values of 0c1. Take the point c where the fluctuation in bR ceases.

influential cases - robust regression

When an outlying influential case is not clearly erroneous, we should proceed a further examination to obtain important clues to improve the model. We may find

  • the omission of an important predictor variable,
  • incorrect functional forms, or
  • needs to dampen the influence of such cases.

Robust regression is an approach of regression to make the results less rely on outlying cases. There are several ways to conduct robust regression:

  1. LAR (= LAD = minimum 1-error) regression Least Absolute Residuals (= Least Absolute Deviations = minimum ell-one-error regression) is done by minimizing Q1(β)=i=1n|Yiβ0β1Xi1βp1Xi,p1|. When there’s an outlier where its residual is large ei>1, then its squared residual gets much larger so that the regression function could overweigh this observation. This can be alleviated by using absolute values of residuals. Note that 1 function is not differentiable. Thus, we cannot obtain an estimator b^ using partial derivatives. Instead, this problem can be dealt by linear programming (a kind of optimization technique).
  2. IRLS robust regression Iteratively Reweighted Least Squares. This approach uses the weighted least square procedure as introduced before, but, weights on residuals are now determined by the magnitude of how far outlying a case is. The weights are updated iteratively in a way to flatten the weights.
  3. LMS regression Least Median Squares. Recall that average is sensitive to outliers, but median is not. Thus, we could minimize the function median(Yiβ0β1Xi1βp1Xi,p1)2.

2. Nonparametric regression

For complex cases, it is hard to guess a right functional form (analytic expression) of the regression function. Also, as the number of parameter increases, the space of observations gets rarefied (there are fewer cases in a neighborhood), so that the regression function tends to behave erratically. For these cases, nonparametric regression could be helpful.

regression trees

Regression tree is a powerful, yet computationally simple method of nonparametric regression. It requires virtually no assumptions, and it is simple to interpret. It is popular for explanatory studies, especially for large data sets.

regression tree

In each branch of a regression tree, the range of a specific predictor variable Xk is partitioned into segments (one for XkC and the other for Xk<C). In the terminal nodes of the tree, we evaluate the estimated regression fit by taking the mean value of the response in that segment. i.e. Yi^=j=1nsYj where Yj’s belong to a segment where ith observation belongs to. Also, ns is the number of observations in the segment.

How to find a “best” regression tree is linked with

  • the number of segmented regions r, and
  • the split points Xk that optimally divides the data into two sets at each branch.

The basic idea is to choose a model which minimizes the error sum of squres for the resulting regression tree. SSE=tiTSSE(ti), where T is the set of all terminal nodes, and SSE(ti)=j=1n(YjYti)2. Here, MSE=SSE/(nr)) and R2=1SSESSTO. The number of candidate models are r(p1) where 1rn. The value of r is usually chosen through validation studies as argminrMSPR.

3. Bootstrap confidence intervals

Above regression approaches are complex so that it is hard to estimate confidence intervals using the previous analysis on σ2(b) or s2(b). Instead, we can approximate confidence intervals by bootstrap method. There are many different procedures to gain bootstrap confidence intervals. One example is reflection method.

An 1α confidence interval of a parameter β is b(b(1α/2)b)βb+(bb(α/2)) where b(1α/2) and b(α/2) are obtained from bootstrap distribution. $$




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Wasserstein Proximals Stabilize Training of Generative Models and Learn Manifolds
  • Lipschitz-Regularized Gradient Flows and Latent Generative Particles
  • Lipschitz Regularized Gradient Flows and Latent Generative Particles
  • Sample generation from unknown distributions - Particle Descent Algorithm induced by (f, Γ)-gradient flow
  • Python ODE solving tutorial