If you need an assist to setup your computer to handle numerical computations and showing interactive results, this 15-min video would give you the first step.
View slides | Jupyter notebook | Youtube video
This 18 min video would provide you the tutorial to “Implementation and Python Library Tutorial for PINNs to Handle Dynamical Systems”.
Lorenz model and its inverse/forward problems are chosen as the example for dynamical systems.
I attached the modified example code for forward/inverse Lorenz model in jupyter notebook (ipnb) and the slides.
View slides | Jupyter notebook - pytorch PINN | Jupyter notebook - DeepXDE | Youtube video
]]>In the previous chapters, errors \(\epsilon_i\)’s are assumed to be
However, in business and economics, time series data often fail to satisfy above assumption. In time series data, error terms are likely to be autocorrelated / serially correlated over time. Is is mainly because
Autocorrelation or positively autocorrelated error terms may trigger problems such as:
Therefore, we need to analyze a regression model with autoregressive errors, test if a model is autocorrelated and then remedy autocorrelation.
The generalized simple linear regression model for one predictor variable when the random error terms follow a first order autoregressive, or AR(1), process is
1: \(Y_t = \beta_0 + \beta_1 X_t + \epsilon_t\)
2: \(\epsilon_t = \rho \epsilon_{t-1} + u_t\)
where \(\rho\) is a parameter such that \(|\rho| <1\) and \(u_t\) are independent \(N(0, \sigma^2)\). \(\rho\) is called the autocorrelation parameter.
Above one variable case can be generalized to multiple variables case. The only thing changed is the first line 1’: \(Y_t = \beta_0 + \beta_1 X_{t1} + \cdots + \beta_{p-1} X_{t,p-1} +\epsilon_t\).
Remark When it comes to a second-order process, the line 2 is changed to 2’: \(\epsilon_t = \rho_1 \epsilon_{t-1} + \rho_2 \epsilon_{t-2} + u_t\).
Proof Note that \(\epsilon_t = \rho \epsilon_{t-1} + u_t = \rho^2 \epsilon_{t-2} + \rho u_{t-1} + u_t\) \(= \rho^3 \epsilon_{t-3} + \rho^2 u_{t-2} + \rho u_{t-1} + u_t = \cdots =\) \(\sum_{s=0}^\infty \rho^s u_{t-s}\).
(2) Since \(u_t\)’s are uncorrelated to each other and \(\sigma^2(u_t) = \sigma^2\), \(\sigma^2(\epsilon_t) = \sum_{s=0}^\infty \rho^{2s} \sigma^2(u_{t-s}) = \sigma^2 \sum_{s=0}^\infty \rho^{2s} = \sigma^2 \frac{1}{1-\rho^2}\) (\(|\rho|<1\)).
(3) Since \(E(\epsilon_t) = 0\), \(\sigma(\epsilon_t, \epsilon_{t-1}) = E(\epsilon_t, \epsilon_{t-1})\) \(=E[(u_t + \rho u_{t-1} + \rho^2 u_{t-2} + \cdots )(u_{t-1} + \rho u_{t-2} + \rho^2 u_{t-3} + \cdots ) ]\) \(=E[u_t + \rho (u_{t-1} + \rho u_{t-2} + \cdots )(u_{t-1} + \rho u_{t-2} + \rho^2 u_{t-3} + \cdots ) ]\) \(= E[u_t (u_{t-1} + \rho u_{t-2} + \cdots )] + \rho E [ u_{t-1} + \rho u_{t-2} + \cdots ]^2\). By independency of \(u_t\), \(E(u_t u_{t-s}) E(u_t) E(u_{t-s})= 0 ~\forall s \neq 0\). \(\Rightarrow \sigma(\epsilon_t, \epsilon_{t-1}) = \rho E [ u_{t-1} + \rho u_{t-2} + \cdots ]^2 = \rho \sigma^2(\epsilon_{t-1}) = \rho \frac{\sigma^2}{1-\rho^2}\).
\(n \times n\) variance-covariance matrix of error \(\sigma^2(\epsilon)\) is given as \([ \kappa, \kappa \rho, \kappa \rho^2, \cdots, \kappa \rho^{n-1} ; \kappa \rho, \kappa, \kappa \rho, \cdots, \kappa \rho^{n-2} ; \vdots ; \kappa \rho^{n-1}, \kappa \rho^{n-2}, \kappa \rho^{n-3}, \cdots, \kappa ]\) where \(\kappa = \frac{\sigma^2}{1-\rho^2}\).
Now, we want to determine if the regression model has autocorrelation or not. Note that \(\rho=0\) implies \(\epsilon_t = u_t\) ; uncorrelated.
\(H_0 : \rho=0\) \(H_1 : \rho>0\) (we consider the positive autocorrelation.)
Test statistic
\[D = \frac{\sum_{t=2}^n (e_t - e_{t-1})^2}{\sum_{t=1}^n e_t^2}\]where \(e_t = Y_t - \hat{Y_t}\).
Exact critical values are difficult to obtain, but there is an upperbound \(d_u\) and lowerbound \(d_l\) such that \(D\) outside these bounds leads to a definite decision. Decision rule:
idea: Since \((\epsilon_t - \epsilon_{t-1})^2\) is small when the errors are positively autocorrelated, small \(D\) implies \(\rho >0\).
Time-ordered effects are often due to the missing key variables. When the long-term persistent effects cannot be captured by one or several predictor variables, a trend component can be added to the regression model, such as a linear/exponential trend, or indicator variables for seasonal effects.
When it failed to cure autocorrelation from addition of predictor variables, we may consider transforming variables to obtain an ordinary regression model. It can be done by \(Y_t^\prime = Y_t - \rho Y_{t-1}\)
\[\Rightarrow Y_t^\prime = (\beta_0 + \beta_1 X_t + \epsilon_t) - \rho (\beta_0 + \beta_1 X_{t-1} + \epsilon_{t-1}) = \beta_0 (1 - \rho) + \beta_1 (X_t - \rho X_{t-1}) + u_t\]Set \(X_t^\prime = X_t - \rho X_{t-1}\), \(\beta_0^\prime = \beta_0 (1 - \rho)\), and \(\beta_1^\prime = \beta_1\). \(\Rightarrow\) \(Y_t^\prime = \beta_0^\prime + \beta_1^\prime X_t^\prime + u_t\) where \(u_t\)’s are uncorrelated and follows \(N(0,\sigma^2)\).
To use the transformed model, we need to estimate \(\rho\). Let’s say that we have \(r\), an estimator of \(\rho\). Then, from the original fitted function \(\hat{Y} = b_0 + b_1 X\), we can restore \(b_0 = \frac{b_0^\prime}{1-r}\), and \(b_1 = b_1^\prime\) where \(s(b_0) = \frac{s(b_0^\prime)}{1-r}\), and \(s(b_1) = s(b_1^\prime)\).
There are three different procedures to obtain transformed models.
It consists of three steps and their iterations. Each step is:
Note This approach does not always work properly. When the error terms are positively autocorrelated, the estimate \(r\) tends to underestimate \(\rho\).
This is analogous to the Box-Cox procedure. The value of \(\rho\) is chosen to minimize the error sum of squres for the transformed regression model. \(SSE = \sum_t (Y_t^\prime - \hat{Y_t^\prime})^2 = \sum_t (Y_t^\prime - b_0^\prime - b_1^\prime X_t^\prime)^2\) Then test for autocorrelation (By Durbin-Watson’s test).
Note This approach does not need iterations. But optimization process is required.
Facts
Therefore, take \(\rho = 1\). Then, \(\beta_0^\prime = \beta_0 (1-\rho) = 0\) and the transformed model is \(Y_t^\prime = \beta_1^\prime X_t^\prime + u_t\) where \(Y_t^\prime = Y_t - \rho Y_{t-1}\), and \(X_t^\prime = X_t - \rho X_{t-1}\). In the transformed model, the intercept term is canceled out.
The fitted regression line in the transformed variables is \(\hat{Y^\prime} = b_1^\prime X^\prime\) which can be transformed back to the original variables as \(\hat{Y} = b_0 + b_1 X\) where \(b_0 = \bar{Y} - b_1^\prime \bar{X}\), and \(b_1 = b_1^\prime\).
Note This approach is effective in a variety of applications and much simpler than other 2 procedures. (The estimation process of \(\rho\) is deleted.)
\(Y_t = \beta_0 + \beta_1 X_t + \epsilon_t\) where \(\epsilon_t = \rho \epsilon_{t-1} + u_t\). i.e. \(Y_t =\)\(\beta_0 + \beta_1 X_t\) \(+\) \(\rho \epsilon_{t-1}\) \(+\) \(u_t\).
Thus, the forecast for the next period \(t+1\), \(F_{t+1}\) is constructed by handling the three components:
\(\Rightarrow\) The forecast for the period \(t+1\) is \(F_{t+1} = \hat{Y_{t+1}} + re_t\).
An approximate \(1-\alpha\) prediction interval for \(Y_{t+1, new}\) is obtained by employing the usual prediction limits for a new observation but based on the transformed observation. The process is \((X_i, Y_i) \rightarrow (X_i^\prime, Y_i^\prime) \rightarrow s^2(pred)\)
Therefore, an approximate \(1-\alpha\) prediction interval for \(Y_{t+1, new}\) is \(F_{t+1} \pm t(1-\alpha/2 ; n-3) s(pred)\) Here, the degree of freedom is \(t-3\) because only \(t-1\) transformed cases are used, and 2 degrees are lost for estimating the parameters \(\beta_0\) and \(\beta_1\).
]]>Transformation is one of the standard remedial measure for a linear model. Recall that its uses are:
In this chapter, we will find additional remedial measures to handle several pitfalls. Then we discuss non-parametric regression methods (which are quite different from previous regression models). A common feature of the remedial measures and alternative regression methods is that estimation procedures from the ways we’ve seen are relatively complex, so we need easier and more generic way to evaluate the precision of these complex estimators. Bootstrapping is one example.
Generalized multiple regression model \(Y_i = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_{p-1} X_{i,p-1} + \epsilon_i\) where \(e_i\)’s are independent, \(e_i \sim N(0, \sigma_i^2)\), \(i = 1,2,\cdots, n\). Here, the error variances may not be equal, so that \(\sigma^2(\epsilon)\) is an \(n\times n\) diagonal matrix where each entry has \(\sigma_i^2\).
When we use ordinary least square estimators, then still we have these properties:
A remedy to handle this problem is to use weighted least squares. First, we start from the simplest case.
when error variances are known Suppose that the errror variances \(\sigma_i^2\) for all \(n\) observations are given. Then we use Maximum likelihood method, again. First, we define the likelihood function \(L(\beta)\).
\[L(\beta) = \prod_{i=1}^n \frac{1}{(2\pi \sigma_i^2)^\frac{1}{2}} exp [ - \frac{1}{2 \sigma_i^2} (Y_i - \beta_0 - \beta_1 X_{i1} - \cdots - \beta_{p-1} X_{i,p-1})^2 ]\]Here, we substitute \(w_i = \frac{1}{\sigma_i^2}\) and rewrite the formula.
\[L(\beta) = \prod_{i=1}^n \left( \frac{w_i}{2\pi}\right)^\frac{1}{2} exp [ - \frac{w_i}{2} (Y_i - \beta_0 - \beta_1 X_{i1} - \cdots - \beta_{p-1} X_{i,p-1})^2 ]\]Since \(\sigma_i^2\)’s are known, parameters for likelihood function are just regression coefficients. Maximum likelihood method finds critical points which maximizes the likelihood function value.
This is just as the same as minimizing \(Q_w = \sum_{i=1}^n w_i (Y_i - \beta_0 - \beta_1 X_{i1} - \cdots - \beta_{p-1} X_{i,p-1})^2\). The normal equation can be written using weights matrix \(W = diag(w_1, w_2, \cdots, w_n)\) as below:
\[X^t WX b_w = X^tW Y\]\(\Rightarrow b_w = (X^t WX)^{-1} X^tW Y\) where \(\sigma^2(b_w) = (X^t WX)^{-1}\). This is called Weighted least square estimator.
when variances are known up to proportionality constant Now, relaxing the condition, assume that we only know proportions among error variances. Consider a case that the second error variance is twice larger than the first one. i.e. \(\sigma_2^2 = 2*\sigma_1^2\). Then we can write \(w_i\)’s in this form: \(w_i = \frac{1}{k_i\sigma^2}\) where \(\sigma^2\) is unknown true error variance of \(\epsilon_1\). Then, \(p \times p\) matrix \(\sigma^2 (b_w) = \sigma^2 (X^t K X)^{-1}\), where \(K = diag(k_1, k_2, \cdots, k_n)\). Since \(\sigma^2\) is unknown, we use \(s^(b_w) = MSE_w (X^t K X)^{-1}\), instead, where \(MSE_w = \frac{\sum_{i=1}^n w_i (Y_i - \hat{Y_i})^2}{n-p} = \frac{\sum_{i=1}^n w_i e_i^2}{n-p}\).
when error variances are unknown We use estimators of the error variances. There are two approaches.
Inference procedures should be followed when weights are estimated. We want to estimate a variance-covariance matrix \(\sigma^2(b_w)\). The confidence intervals for regression coefficients are estimated with \(s^2(b_{w_k})\) where \(k\) stands for a bootstrap index. Then apply bootstraping method.
Lastly, we can use the ordinary least squares leaving the error variances unequal. Regression coefficient \(b\) is still unbiased and consistent, but no longer a minimum variance estimator, where its variance is given as \(\sigma^2(b) = (X^tX)^{-1} X^t \sigma^2(\epsilon)X (X^tX)^{-1}\) and it is estimated by \(s^2(b) = (X^tX)^{-1} X^t S_0 X (X^tX)^{-1}\) where \(S_0 = diag(e_1^2, e_2^2, \cdots, e_n^2)\).
Remark on weighted least squares Weighted least squares can be interpreted as transforming the data \((X_i, Y_i, \epsilon_i)\) by \(W^{\frac{1}{2}}\). This can be written as
\[W^{\frac{1}{2}}Y = W^{\frac{1}{2}}X\beta + W^{\frac{1}{2}}\epsilon\]and by setting \(Y_w = W^{\frac{1}{2}} Y\), \(X_w = W^{\frac{1}{2}} X\), and \(\epsilon_w = W^{\frac{1}{2}} \epsilon\), the transformed variables follow the ordinary linear regression model
\[Y_w = X_w \beta + \epsilon_w\]where \(E(\epsilon_w) = W^{\frac{1}{2}} E(\epsilon) = W^{\frac{1}{2}} 0 = 0\) and \(\sigma^2(\epsilon_w) = W^{\frac{1}{2}} \sigma^2(\epsilon) W^{\frac{1}{2}} = W^{\frac{1}{2}} W^{-1} W^{\frac{1}{2}} = I\). The regression coefficients are given as \(b_w = (X_w^t X_w)^{-1} X_w^t Y_w\).
When multicollineariry is detected in a regression model, we have fixed the model by
Ridge regression perturb the estimated coefficient \(b_R\) from the unbiased estimator \(b\), and therefore it could remedy multicollinearity problem. In other words, our unbiased estimator \(b\) is obtained by the observation \(X\) with multicollinearity, but slightly perturbed estimator \(b_R\) is obtained by a virtual perturbed observation \(X_R\) without multicollinearity. When perturbing \(b_R\), we add restriction on \(b_R\) to have small norm(length). Length restriction makes the variance of the sampling distribution of \(b_R\) shrink. Although, \(b_R\) is no longer an unbiased estimator, we have more chance to obtain \(E(b_R)\) from a sampled \(b_R\) compared to obtaining \(E(b)=\beta\) from a sampled \(b\). The relationship is given as a figure.
To obtain \(b_R\), we consider using correlation transformed \(X\) observations. This process normalizes the variable \(X\) adjusting the magnitude of \(SSE = (Y-X\beta)^t (Y-X\beta)\). \(b_R\) can be obtained by minimizing a modified sum suquared error function \(Q(\beta) = (Y-X\beta)^t (Y-X\beta) + c \beta^t \beta\) where \(c\) is a given biasing coefficient and \(c \leq 0\). Squared \(\ell_2\) norm of \(\beta\) is added in the function. This allows to minimize both \(SSE\) and the length of \(\beta\), at the same time. The intensity to keep focus on minimizing the length of \(\beta\) can be adjusted by choosing the coefficient \(c\). Larger the value \(c\), the minimizing process would be more focused on minimizing the length of \(\beta\). Resulting normal equation is \((X^t X + cI) b_R = X^t Y\). Therefore, \(b_R = (X^tX + cI)^{-1} X^tY\).
Choice of biasing constant \(c\) As the biasing constant \(c\) increases, the bias of the model would be increased. But there exists a value of \(c\) for which the regression estimator \(b_R\) has a smaller total mean squared error \(E(b^R - \beta)^2\) than the ordinary LSE \(b\). But optimum value of \(c\) varies from one application to another and it is unknown. Common heuristics to determine an appropriate \(c\) are:
When an outlying influential case is not clearly erroneous, we should proceed a further examination to obtain important clues to improve the model. We may find
Robust regression is an approach of regression to make the results less rely on outlying cases. There are several ways to conduct robust regression:
For complex cases, it is hard to guess a right functional form (analytic expression) of the regression function. Also, as the number of parameter increases, the space of observations gets rarefied (there are fewer cases in a neighborhood), so that the regression function tends to behave erratically. For these cases, nonparametric regression could be helpful.
Regression tree is a powerful, yet computationally simple method of nonparametric regression. It requires virtually no assumptions, and it is simple to interpret. It is popular for explanatory studies, especially for large data sets.
In each branch of a regression tree, the range of a specific predictor variable \(X_k\) is partitioned into segments (one for \(X_k \leq C\) and the other for \(X_k < C\)). In the terminal nodes of the tree, we evaluate the estimated regression fit by taking the mean value of the response in that segment. i.e. \(\hat{Y_i} = \sum_{j=1}^{n_s} Y_j\) where \(Y_j\)’s belong to a segment where \(i\)th observation belongs to. Also, \(n_s\) is the number of observations in the segment.
How to find a “best” regression tree is linked with
The basic idea is to choose a model which minimizes the error sum of squres for the resulting regression tree. \(SSE = \sum_{t_i \in T} SSE(t_i)\), where \(T\) is the set of all terminal nodes, and \(SSE(t_i) = \sum_{j=1}^n (Y_j - \overline{Y_{t_i}})^2\). Here, \(MSE = SSE/(n-r)\)) and \(R^2 = 1- \frac{SSE}{SSTO}\). The number of candidate models are \(r(p-1)\) where \(1 \leq r \leq n\). The value of \(r\) is usually chosen through validation studies as \(argmin_r{MSPR}\).
Above regression approaches are complex so that it is hard to estimate confidence intervals using the previous analysis on \(\sigma^2(b)\) or \(s^2(b)\). Instead, we can approximate confidence intervals by bootstrap method. There are many different procedures to gain bootstrap confidence intervals. One example is reflection method.
An \(1-\alpha\) confidence interval of a parameter \(\beta\) is \(b - (b^\ast(1-\alpha/2) - b) \leq \beta \leq b + (b - b^\ast(\alpha/2))\) where \(b^\ast(1-\alpha/2)\) and \(b^\ast(\alpha/2)\) are obtained from bootstrap distribution.
]]>Once we’ve established a model, the next step is to try to improve the model. Or if the model is stuck in a trouble, we should diagnose in which part the trouble comes. In linear regression model, the possible reasons would be
Here are some suggestions for detecting such factors.
You draw added-variable plots to detect if there is an improper functional form of a predictor variable. For a regression model with 2 predictor variables \(X_1\) and \(X_2\), we want to find the regression effect for \(X_1\) given that \(X_2\) is already in the model. Here are the steps.
Regress \(Y\) on \(X_2\) and obtain fitted values with \(X_2\) data, \(\hat{Y_i}(X_2) = b_0 + b_1 X_{i2}\) and its residuals, $$e_i(Y | X_2) = Y_i - \hat{Y_i}(X_2)$$. |
Regress \(X_1\)(testing variable) on \(X_2\) and obtain fitted values with \(X_2\) data, \(\hat{X_{i1}}(X_2) = b_0^\ast + b_1^\ast X_{i2}\) and its residuals, $$e_i(X_{i1} | X_2) = X_{i1} - \hat{X_{i1}}(X_2)$$. |
a. No additional information exists from adding $$X_1$$.
b. There is additional linear information from adding $$X_1$$.
c. There is additional nonlinear information from adding $$X_1$$.
When it comes to a model with one-predictor variable, we can simply look at scatter plots in order to identify any outliers and high leverage observations.
What if we have a model with multiple predictor variables? We can identify such observations by diagonal entries of the hat matrix \(H = X(X^tX)^{-1}X^t\).
\(h_{ii}\) has useful properties such as
By property 2, (mean of \(h_{ii}\)) = \(\frac{p}{n}\). Also, by property 3, we can see that if \(h_{ii}\) is large, then the observed response \(Y_i\) plays a large role in the value of the predicted response \(\hat{Y_i}\). \(\Rightarrow\) if \(h_{ii}\) is twice larger than the average (\(h_{ii}>\frac{2p}{n}\)), then such case is a high leverage observation.
Recall We have used studentized residual \(r_i = \frac{e_i}{s^2(e_i)} = \frac{e_i}{\sqrt{MSE(1-h_{ii})}} \sim t(n-p)\) to identify unusual \(Y\) values. But this approach can have a risk. That is, a potential outlier pulled the estimated regression function toward them, so that it is not flagged as an outlier using the standardized residual criterion.
To address this issue, we use an alternative criterion using studentized residuals. We delete an observation \(Y_i\) and fit a regression model \(\hat{Y_{(i)}}\) on the remaining \(n–1\) observations. Then, we compare the observed response value \(Y_i\) to their fitted value from \(\hat{Y_{(i)}}\). This produces deleted residuals \(d_i = Y_i - \hat{Y_{i(i)}}\). Standardizing the deleted residuals produces studentized residuals.
We can prove that \(d_i = Y_i - \hat{Y_{i(i)}} = \frac{e_i}{1-h_{ii}}\) proof where \(s^2(d_i) = MSE_{(i)}\left(1 + X_{(i)}^t(X_{(i)}^tX_{(i)})^{-1}X_{(i)} \right) = \frac{MSE_{(i)}}{1-h_{ii}}\). Moreover, we can obtain \(MSE_{(i)}\) using \(MSE\) by the formula \((n-p)MSE = (n-p-1)MSE_{(i)} + \frac{e_i^2}{1-h_{ii}}\). Therefore, \(MSE_{(i)} = \frac{1}{n-p-1} \left((n-p)MSE - \frac{e_i^2}{1-h_{ii}} \right) = MSE + \frac{1}{n-p-1} \left(MSE - \frac{e_i^2}{1-h_{ii}}\right)\).
Thanks to these efforts, we don’t need to fit \(n\) distinct regression lines. And we test for this alternative standardized residual \(t_i = \frac{d_i}{s(d_i)} \sim t(n-p-1)\). If \(\mid t \mid\) is large (from large \(h_{ii}\)), we can conclude that following observation \(Y_i\) is an outlier.
DFFITS : Influence of the observation \(Y_i\) on the \(i\)th fitted value \(\hat{Y_i}\)
\[DFFITS_i = \frac{\hat{Y_i} - \hat{Y_{i(i)}}}{\sqrt{MSE_{(i)} h_{ii}}} = t_i \left( \frac{h_{ii}}{1-h_{ii}} \right)^\frac{1}{2}\]If it is larger than 1 (for small to medium sized datasets) and larger than \(2\sqrt{\frac{p}{n}}\) (for large datasets), then \(i\)th case is said to be influential.
Cook’s distance : Influence of the observation \(Y_i\) on the fitted function \(\hat{Y}\)
\[D_i = \frac{\sum_{j=1}^n (\hat{Y_j} - \hat{Y_{j(i)}})^2}{p \cdot MSE} = \frac{e_i^2}{p \cdot MSE} \frac{h_{ii}}{(1-h_{ii})^2} = \frac{1}{p}t_i^2\frac{h_{ii}}{1-h_{ii}}\]If \(D_i\) is 50\% or more, then \(i\)th case is said to be influential.
DFBETAS : Influence of the observation \(Y_i\) on the regression coefficients
\[DFBETAS_{k(i)} = \frac{b_k - b_{k(i)}}{\sqrt{MSE_{(i)} C_{kk}}}, k=0,1,\cdots, p-1\]where \(C_{kk}\) is the \(k\)th diagonal element of \((X^tX)^{-1}\). Note that \(C_{kk} = \frac{\sigma^2(b_k)}{\sigma^2}\). If its absolute value is large, the \(i\) th case is said to be influential.
From a standardized regression model, we obtained correlation matrix \(r_{XX} = x^tx\). Now, we have variance-covariance matrix of standardized regression coefficients \(\sigma^2(b^\ast) = (\sigma^\ast)^2 r_{XX}^{-1}\).
This is the standardized version of \(\sigma^2(b) = \sigma^2 (X^tX)^{-1}\).
Let us define the \(k\)th variance inflation factor \(VIF_k\) as the \(k\)th diagonal element of \(r_{XX}^{-1}\). We can find that \(\sigma^2(b_k^\ast) = (\sigma^\ast)^2 VIF_k\).
\(VIF_k \geq 1\), and we can diagnose multicollinearity among predictor variables by its magnitude.
If \(VIF_k \gg 10\) or \(\overline{VIF}= \frac{\sum_{k=1}^{p-1} VIF_k}{p-1}\) is considerably larger than 1, we can say that \(k\)th predictor variable or the entire model has multicollinearity.
]]>