RA ch9 Model selection and validation

Ch 9: Model selection and validation

When we have p1 predictor varaibles, we can construct 2p1 different linear models by choosing the parameters to use. Complex models could reduce sum squared error SSE significantly, but it may cause overfitting, lacking adaptability to unseen cases. Therefore, when we choose a model, we need to consider 2 conditions below at the same time.

  • causing small error which can be estimated by SSE
  • simple enough to perform well with new data

This chapter focuses on introducing several criteria to choose a good linear regression model. Some could be only used(or has proved) in linear regression, while others are used in general models.

Outline

  1. Criteria for model selection
    • coefficient of multiple determination, Rp2
    • adjusted coefficient of multiple determination, Ra,p2
    • Mallow’s Cp
    • AICp and BICp
    • Prediction sum of squares, PRESSp
  2. Model validation

1. Criteria for model selection

Criteria for model selection below a model with

  • small SSE
  • small number of parameters p.

coefficient of multiple determination, Rp2

Rp2=1SSEpSSTO

Larger is better. It can be used in general models. drawback: always find the most complex model

adjusted coefficient of multiple determination, Ra,p2

Ra,p2=1SSEp/(np)SSTO/(n1)

adjusted version of R2 Larger is better. It can be used in general models.

Mallow’s Cp

Consider the sum of square of model error to the mean response i(Yi^E(Yi))2. It could be divided into two components.

  • i(Yi^E(Yi^))2=iσi2(Yi^) : random error caused by random noise (cannot be handled by the model itself.)
  • i(E(Yi^)E(Yi))2 : bias of the model

We have a criterion measure Γp=1σ2i=1n(Yi^E(Yi))2=iσi2(Yi^)+i(E(Yi^)E(Yi))2. But there are some problems:

  1. We are not aware of σ2 in general.
  2. We cannot obtain E(Yi).

Let us have the full model with P1 predictor variables, and each model that we want to assess has p1 predictor variables.

  • To settle for 1, we use MSEfull=MSE(X1,X2,,XP1) (MSE of the full model) rather than σ2.
  • To resolve 2, we use the formula E(SSEp)=i(E(Yi^)E(Yi))2+(np)σ2. (You may prove this using the fact that Tr(Var(Y^)=Tr(σ2H)=pσ2.)
    Therefore, Γp=1σ2[E(SSEp)(n2p)σ2].

    Γpσ2=i(E(Yi^)E(Yi))2+(np)σ2(np)σ2+iσi2(Yi^) Since σi2(Yi^)=pσ2, Γp=E(SSEp)(n2p)σ2.

Now we substitute Γp=1σ2[E(SSEp)(n2p)σ2] with Cp=SSEpMSEfull(n2p).

Note that when there is no bias in the model with p1 predictor variables, then Γp=p.

1σ2iσ2(Yi^)2=1σ2pσ2=p.

By the way, Cp (an estimator of Γp) becomes p when it is the full model.

SSEpMSEp(n2p)=np(n2p)=p.

Therefore, Smaller Cp near p is better. Cp formula is unique for assessing regression model. For other models, you should analyze Γp for their situation.

AICp and BICp

AICp=nlog(SSEp)nlog(n)+2p BICp=nlog(SSEp)nlog(n)+[log(n)]p where [k] stands for the largest integer which does not exceed k

Smaller is better. It is a simple criterion.

Note BICp is more sensitive to p (penalize more to large p). To be exact, if n8, then AICpBICp.

Prediction sum of squares, PRESSp

PRESSp doesn’t penalize a model with many predictor variables. Instead, it directly calculate prediction error to unseen data. PRESSp=i(YiYi(i)^)2 where Yi(i)^ denotes ith fitted value using a fitted line constructed without (Xi,Yi) data.

Smaller is better. It can be used in general models. However if the number of predictior variables are large, the computation is heavy. Therefore, we don’t go through all the candidate models, but use stepwise selection method.

stepwise selection method Stepwise selection method is a greedy approach to solve an optimization problem. At each stage, it calculates criterion function values of all the states that could be reached in one step. It chooses a path to the best state for each step.

  • It doesn’t guarantee obtaining the global optimum.
  • But it is the easiest approach to handle(computationally or make an algorithm to solve) optimization problems.
  • It significantly reduces the computational load.
  • And it gives a quite reasonable solution in a limited time.

Here, we could use forward/backward/both stepwise method which has a computational complexity of ο(p2). Ex) forward method We start from a model with largest T-value t=bks(bk). Then add variable with largest PRESSp. Repeat this adding process until there is no significant change.

Remark Generally, computing PRESSp is costly. However, when it comes to assessing a linear regression model, we have a good formula to compute this deleted residual YiYi(i)^. It is going to be introduced in the next chapter.

2. Model validation

To find a model which performs well in prediction, we check a candidate model against independent data which is unseen, uncorrelated but from the same population. The steps are

  1. divide a given dataset into 2 non-overlapping sets
    training data : validation data = 7:3 or 6:4
  2. make a model using training data (Xt,Yt) and obtain parameters βt^=(XttXt)1XttYt
  3. calculate the mean squared prediction error
MSPR=i=1n(Yi,newYi^)2n

where n = number of validation data, and Yi^=Xi,newtβt^. For a properly trained model, MSPR>MSE. But if MSPRMSE, the model is invalid (lack of prediction ability).


Here is the jupyter notebook script to run several practice codes using R.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Wasserstein Proximals Stabilize Training of Generative Models and Learn Manifolds
  • Lipschitz-Regularized Gradient Flows and Latent Generative Particles
  • Lipschitz Regularized Gradient Flows and Latent Generative Particles
  • Sample generation from unknown distributions - Particle Descent Algorithm induced by (f, Γ)-gradient flow
  • Python ODE solving tutorial