Jekyll2022-12-27T07:42:08-05:00https://hyemingu.github.io/feed.xmlblankA simple, whitespace theme for academics. Based on [*folio](https://github.com/bogoli/-folio) design. Sample generation from unknown distributions - Particle Descent Algorithm induced by (f, Γ)-gradient flow2022-05-31T13:39:00-04:002022-05-31T13:39:00-04:00https://hyemingu.github.io/blog/2022/sample_generation_through_gpa_by_gradient_flowAbstract This project introduces an ongoing research to generate samples from a data set where the distribution is unknown. This project keeps focus on mass transportation approach to handle the problem. First, preliminaries on mass transportation problem and gradient flows on probability measures will be briefly introduced. Then, particle descent algorithm which is equipped with a flexible measure of distance will be introduced. The experiments on the low dimensional examples elaborate the dependency of this measure of distance on the target probability distribution. Strengths of this work comes from the flexible choice of the measure of distance and an interpolated behavior between f-divergences and Γ-intergral probability metrics. Also, the efficiency of this algorithm will be seen by comparing the convergence of a different algorithm, generative adversarial network. Then, a different approach fueled by Markov chain monte carlo will be briefly discussed in application of sample generation in a high dimensional data such as image data.

]]>
Python ODE solving tutorial2021-12-05T12:39:00-05:002021-12-05T12:39:00-05:00https://hyemingu.github.io/blog/2021/python_ode_solving1. Python Tutorial and JupyterNotebook Setup

If you need an assist to setup your computer to handle numerical computations and showing interactive results, this 15-min video would give you the first step.

## 3. PINN for Dynamical systems

This 18 min video would provide you the tutorial to “Implementation and Python Library Tutorial for PINNs to Handle Dynamical Systems”.

Lorenz model and its inverse/forward problems are chosen as the example for dynamical systems.

I attached the modified example code for forward/inverse Lorenz model in jupyter notebook (ipnb) and the slides.

]]>
Lorenz Equations for Atmospheric Convection Modeling2021-05-31T13:39:00-04:002021-05-31T13:39:00-04:00https://hyemingu.github.io/blog/2021/lorenz_eqAbstract The Lorenz model is a dynamical system of three first order differential equations. It was designed by Lorenz as a simplified model of atmospheric convection. This project assumes that the Earth’s atmosphere is an incompressible fluid situated between two horizontal planes. Using governing equations in 2D hydrodynamics, the steps of Lorenz are followed to derive the Lorenz equations from an abstract climate model. Then, properties of the Lorenz equations are explored and illustrated by individual examples and their interpretations.

]]>
Learning operators2021-05-31T13:39:00-04:002021-05-31T13:39:00-04:00https://hyemingu.github.io/blog/2021/learning_operatorAbstract As the Neural Network framework(Physics-informed Neural Network) is introduced to solve PDEs and dynamical systems, solutions of differential equations are learned from data, and this could replace or improve the conventional numerical solvers. Still, there are limitations on this “learning functions”. A new framework of “learning operators” came up in order to cure the limitations, and in specific, DeepONet provided a Neural Network architecture for learning operators which map functions to functions. This talk will show the mechanism and deeds of DeepONet(2020) and mention its follow-up studies.

View slides

]]>
Data-dependent Kernel Support Vector Machine classifiers in Reproducing Kernel Hilbert Space2021-05-30T13:39:00-04:002021-05-30T13:39:00-04:00https://hyemingu.github.io/blog/2021/kernelSVM_in_RKHSAbstract Support Vector Machine (SVM) provides a linear classifier for binary classification problems. Complex decision boundaries in the input feature space are handled by nonlinear kernels to the SVM. Theories in Reproducing Kernel Hilbert Spaces (RKHS) state that, given a kernel $$\mathcal{K}$$ and a set of $$M$$ given data $$(x_i,y_i)$$, for $$i=1,\cdots,M$$, a SVM classifier function can be written as $$f(x) =\alpha_0 + \sum_{i=1}^M \alpha_i \mathcal{K}(x, x_i)$$ for some coefficients $$\alpha_i$$s. Also, applying conformal transforms to a positive definite kernel produces another positive definite kernel which are in more complexity. Hence, in case that well-known kernels fail given the current training data, a new kernel can be tried by optimizing the coefficients of a conformal kernel in the way to maximize the ratio “(Between-class error)/(Within-class error)” of the training data. Here, data-dependent kernel SVM is applied to an application of classifying tumor/tumor-free organs from gene expression data and compared its classification performance with other well-known kernels.

View file

]]>
R Bioinformatics 2. R Genome informatics methodology and practice.2020-12-30T12:39:00-05:002020-12-30T12:39:00-05:00https://hyemingu.github.io/blog/2020/r_bioinformaticsThis book is a tutorial for R Bioinformatics practices in 2020 organized by Hyemin Gu and Yijun Kim in Ewha Womans University Mokdong hospital. The tutorial focuses on gene expression data analysis.

View file

]]>
R Bioinformatics 1. R Statistics.2020-09-30T13:39:00-04:002020-09-30T13:39:00-04:00https://hyemingu.github.io/blog/2020/r_statisticsThis book is 6 days classes lecture notes for R Bioinformatics class in 2020 taught by Hyemin Gu. The lecture was done by line-by-line code running and its explanation.

View file

]]>
RA ch12 Autocorrelation in time series data2020-07-11T11:12:00-04:002020-07-11T11:12:00-04:00https://hyemingu.github.io/blog/2020/regression_ch12Ch 12: Autocorrelation in time series data

In the previous chapters, errors $$\epsilon_i$$’s are assumed to be

• uncorrelated random variables or
• independent normal random variables.

However, in business and economics, time series data often fail to satisfy above assumption. In time series data, error terms are likely to be autocorrelated / serially correlated over time. Is is mainly because

• one or more important predictor variables are missing (when time-ordered effects of certain variables are positively correlated, so do error terms in the regression model. The error terms would include effects of missing variables.)
• there are systematic coverage errors in the response variable.

Autocorrelation or positively autocorrelated error terms may trigger problems such as:

• $$E(b) = \beta$$ but $$Var(b)$$ is not the minimum
• [$$MSE$$ / $$s(b_k)$$ ] may seriously underestimate [$$\sigma^2$$ / $$\sigma(b_k)$$] $$\Rightarrow$$ confidence intervals and tests are no longer applicable.
• the regression line sharply differs from the true line
• the regression line is sensitive to the initial error $$\epsilon_0$$
Problem of autocorrelation

Therefore, we need to analyze a regression model with autoregressive errors, test if a model is autocorrelated and then remedy autocorrelation.

## Outline

1. First order autoregressive error model
• One predictor variable case
• properties of error terms
2. Durbin-Watson test for autocorrelation
3. Remedial measures for autocorrelation
• use of transformed variables
• way 1: Cochrane-Orcutt procedure
• way 2: Hildreth-Lu procedure
• way 3: first difference procedure
4. forecasting with autocorrelated error terms

## 1. First order autoregressive error model

### one predictor variable case

The generalized simple linear regression model for one predictor variable when the random error terms follow a first order autoregressive, or AR(1), process is

1: $$Y_t = \beta_0 + \beta_1 X_t + \epsilon_t$$

2: $$\epsilon_t = \rho \epsilon_{t-1} + u_t$$

where $$\rho$$ is a parameter such that $$|\rho| <1$$ and $$u_t$$ are independent $$N(0, \sigma^2)$$. $$\rho$$ is called the autocorrelation parameter.

Above one variable case can be generalized to multiple variables case. The only thing changed is the first line 1’: $$Y_t = \beta_0 + \beta_1 X_{t1} + \cdots + \beta_{p-1} X_{t,p-1} +\epsilon_t$$.

Remark When it comes to a second-order process, the line 2 is changed to 2’: $$\epsilon_t = \rho_1 \epsilon_{t-1} + \rho_2 \epsilon_{t-2} + u_t$$.

### properties of error terms

1. $E(\epsilon_t) = 0$
2. $\sigma^2(\epsilon_t) = \frac{\sigma^2}{1-\rho^2}$
3. $\sigma(\epsilon_t, \epsilon_{t-1}) = \rho \frac{\sigma^2}{1-\rho^2}$
4. $\sigma(\epsilon_t, \epsilon_{t-s}) = \rho^s \frac{\sigma^2}{1-\rho^2}$
5. $\rho(\epsilon_t, \epsilon_{t-1}) = \frac{\sigma(\epsilon_t, \epsilon_{t-1})}{\sigma(\epsilon_t) \sigma(\epsilon_{t-1})} = \rho$
6. $\rho(\epsilon_t, \epsilon_{t-s}) = \frac{\sigma(\epsilon_t, \epsilon_{t-1})}{\sigma(\epsilon_t) \sigma(\epsilon_{t-1})} = \rho^s$
7. $$\rho=0$$ implies that the error terms are uncorrelated.

Proof Note that $$\epsilon_t = \rho \epsilon_{t-1} + u_t = \rho^2 \epsilon_{t-2} + \rho u_{t-1} + u_t$$ $$= \rho^3 \epsilon_{t-3} + \rho^2 u_{t-2} + \rho u_{t-1} + u_t = \cdots =$$ $$\sum_{s=0}^\infty \rho^s u_{t-s}$$.

(2) Since $$u_t$$’s are uncorrelated to each other and $$\sigma^2(u_t) = \sigma^2$$, $$\sigma^2(\epsilon_t) = \sum_{s=0}^\infty \rho^{2s} \sigma^2(u_{t-s}) = \sigma^2 \sum_{s=0}^\infty \rho^{2s} = \sigma^2 \frac{1}{1-\rho^2}$$ ($$|\rho|<1$$).

(3) Since $$E(\epsilon_t) = 0$$, $$\sigma(\epsilon_t, \epsilon_{t-1}) = E(\epsilon_t, \epsilon_{t-1})$$ $$=E[(u_t + \rho u_{t-1} + \rho^2 u_{t-2} + \cdots )(u_{t-1} + \rho u_{t-2} + \rho^2 u_{t-3} + \cdots ) ]$$ $$=E[u_t + \rho (u_{t-1} + \rho u_{t-2} + \cdots )(u_{t-1} + \rho u_{t-2} + \rho^2 u_{t-3} + \cdots ) ]$$ $$= E[u_t (u_{t-1} + \rho u_{t-2} + \cdots )] + \rho E [ u_{t-1} + \rho u_{t-2} + \cdots ]^2$$. By independency of $$u_t$$, $$E(u_t u_{t-s}) E(u_t) E(u_{t-s})= 0 ~\forall s \neq 0$$. $$\Rightarrow \sigma(\epsilon_t, \epsilon_{t-1}) = \rho E [ u_{t-1} + \rho u_{t-2} + \cdots ]^2 = \rho \sigma^2(\epsilon_{t-1}) = \rho \frac{\sigma^2}{1-\rho^2}$$.

$$n \times n$$ variance-covariance matrix of error $$\sigma^2(\epsilon)$$ is given as $$[ \kappa, \kappa \rho, \kappa \rho^2, \cdots, \kappa \rho^{n-1} ; \kappa \rho, \kappa, \kappa \rho, \cdots, \kappa \rho^{n-2} ; \vdots ; \kappa \rho^{n-1}, \kappa \rho^{n-2}, \kappa \rho^{n-3}, \cdots, \kappa ]$$ where $$\kappa = \frac{\sigma^2}{1-\rho^2}$$.

## 2. Durbin-Watson test for autocorrelation

Now, we want to determine if the regression model has autocorrelation or not. Note that $$\rho=0$$ implies $$\epsilon_t = u_t$$ ; uncorrelated.

$$H_0 : \rho=0$$ $$H_1 : \rho>0$$ (we consider the positive autocorrelation.)

Test statistic

$D = \frac{\sum_{t=2}^n (e_t - e_{t-1})^2}{\sum_{t=1}^n e_t^2}$

where $$e_t = Y_t - \hat{Y_t}$$.

Exact critical values are difficult to obtain, but there is an upperbound $$d_u$$ and lowerbound $$d_l$$ such that $$D$$ outside these bounds leads to a definite decision. Decision rule:

• If $$D > d_u$$, conclude $$H_0$$.
• If $$D <$$, conclude $$H_1$$.
• If $$d_l < D < d_u$$, inconclusive.

idea: Since $$(\epsilon_t - \epsilon_{t-1})^2$$ is small when the errors are positively autocorrelated, small $$D$$ implies $$\rho >0$$.

## 3. Remedial measures for autocorrelation

Time-ordered effects are often due to the missing key variables. When the long-term persistent effects cannot be captured by one or several predictor variables, a trend component can be added to the regression model, such as a linear/exponential trend, or indicator variables for seasonal effects.

### use of transformed variables

When it failed to cure autocorrelation from addition of predictor variables, we may consider transforming variables to obtain an ordinary regression model. It can be done by $$Y_t^\prime = Y_t - \rho Y_{t-1}$$

$\Rightarrow Y_t^\prime = (\beta_0 + \beta_1 X_t + \epsilon_t) - \rho (\beta_0 + \beta_1 X_{t-1} + \epsilon_{t-1}) = \beta_0 (1 - \rho) + \beta_1 (X_t - \rho X_{t-1}) + u_t$

Set $$X_t^\prime = X_t - \rho X_{t-1}$$, $$\beta_0^\prime = \beta_0 (1 - \rho)$$, and $$\beta_1^\prime = \beta_1$$. $$\Rightarrow$$ $$Y_t^\prime = \beta_0^\prime + \beta_1^\prime X_t^\prime + u_t$$ where $$u_t$$’s are uncorrelated and follows $$N(0,\sigma^2)$$.

To use the transformed model, we need to estimate $$\rho$$. Let’s say that we have $$r$$, an estimator of $$\rho$$. Then, from the original fitted function $$\hat{Y} = b_0 + b_1 X$$, we can restore $$b_0 = \frac{b_0^\prime}{1-r}$$, and $$b_1 = b_1^\prime$$ where $$s(b_0) = \frac{s(b_0^\prime)}{1-r}$$, and $$s(b_1) = s(b_1^\prime)$$.

There are three different procedures to obtain transformed models.

###### way 1: Cochrane-Orcutt procedure

It consists of three steps and their iterations. Each step is:

1. Estimate $$\rho$$.
Consider a regression function through the origin: $$\epsilon_t = \rho \epsilon_{t-1} + u_t$$ where $$u_t$$ is a random disturbance term. Since $$\epsilon_t$$ and $$\epsilon_{t-1}$$ are unknown, we use $$e_t$$ and $$e_{t-1}$$, instead. The estimator of the slope $$r$$ is $$r = \frac{\sum_{t=2}^n e_{t-1}e_t}{\sum_{t=2}^n e_{t-1}^2}$$
2. Fit a transformed model. $$Y_t^\prime = \beta_0^\prime + \beta_1^\prime X_t^\prime + u_t$$
3. Test for autocorrelation. By Durbin-Watson’s test

Note This approach does not always work properly. When the error terms are positively autocorrelated, the estimate $$r$$ tends to underestimate $$\rho$$.

###### way 2: Hildreth-Lu procedure

This is analogous to the Box-Cox procedure. The value of $$\rho$$ is chosen to minimize the error sum of squres for the transformed regression model. $$SSE = \sum_t (Y_t^\prime - \hat{Y_t^\prime})^2 = \sum_t (Y_t^\prime - b_0^\prime - b_1^\prime X_t^\prime)^2$$ Then test for autocorrelation (By Durbin-Watson’s test).

Note This approach does not need iterations. But optimization process is required.

###### way 3: first difference procedure

Facts

• The autocorrelation parameter $$\rho$$ is frequently large (close to 1).
• $$SSE(\rho)$$ is often quite flat for large values of $$\rho$$. (When using optimization approach (way 2: Hildreth-Lu procedure).

Therefore, take $$\rho = 1$$. Then, $$\beta_0^\prime = \beta_0 (1-\rho) = 0$$ and the transformed model is $$Y_t^\prime = \beta_1^\prime X_t^\prime + u_t$$ where $$Y_t^\prime = Y_t - \rho Y_{t-1}$$, and $$X_t^\prime = X_t - \rho X_{t-1}$$. In the transformed model, the intercept term is canceled out.

The fitted regression line in the transformed variables is $$\hat{Y^\prime} = b_1^\prime X^\prime$$ which can be transformed back to the original variables as $$\hat{Y} = b_0 + b_1 X$$ where $$b_0 = \bar{Y} - b_1^\prime \bar{X}$$, and $$b_1 = b_1^\prime$$.

Note This approach is effective in a variety of applications and much simpler than other 2 procedures. (The estimation process of $$\rho$$ is deleted.)

## 4. forecasting with autocorrelated error terms

$$Y_t = \beta_0 + \beta_1 X_t + \epsilon_t$$ where $$\epsilon_t = \rho \epsilon_{t-1} + u_t$$. i.e. $$Y_t =$$$$\beta_0 + \beta_1 X_t$$ $$+$$ $$\rho \epsilon_{t-1}$$ $$+$$ $$u_t$$.

Thus, the forecast for the next period $$t+1$$, $$F_{t+1}$$ is constructed by handling the three components:

1. Given $$X_{t+1}$$, estimate $$\beta_0 + \beta_1 X_t$$ from the fitted regression function $$\hat{Y_{t+1}} = b_0 + b_1 X_{t+1}$$.
2. $$\rho$$ is estimated by $$r$$ and $$\epsilon_t$$ is estimated by $$e_t$$. i.e. $$e_t = Y_t - (b_0 + b_1 X_t) = Y_t - \hat{Y_t}$$. Thus, $$\rho \epsilon_t$$ is estiamted by $$re_t$$.
3. The disturbance term $$u_{t+1}$$ has expected value zero and is independent of earlier information. Therefore $$E(u_{t+1}) = 0$$.

$$\Rightarrow$$ The forecast for the period $$t+1$$ is $$F_{t+1} = \hat{Y_{t+1}} + re_t$$.

An approximate $$1-\alpha$$ prediction interval for $$Y_{t+1, new}$$ is obtained by employing the usual prediction limits for a new observation but based on the transformed observation. The process is $$(X_i, Y_i) \rightarrow (X_i^\prime, Y_i^\prime) \rightarrow s^2(pred)$$

Therefore, an approximate $$1-\alpha$$ prediction interval for $$Y_{t+1, new}$$ is $$F_{t+1} \pm t(1-\alpha/2 ; n-3) s(pred)$$ Here, the degree of freedom is $$t-3$$ because only $$t-1$$ transformed cases are used, and 2 degrees are lost for estimating the parameters $$\beta_0$$ and $$\beta_1$$.

]]>
RA ch11 Remedial measures2020-07-10T11:12:00-04:002020-07-10T11:12:00-04:00https://hyemingu.github.io/blog/2020/regression_ch11Ch 11: Remedial measures

Transformation is one of the standard remedial measure for a linear model. Recall that its uses are:

1. to linearize the regression relationship
2. to make the error distribution more nearly normal
3. to make the variances of the error terms more nearly equal.

In this chapter, we will find additional remedial measures to handle several pitfalls. Then we discuss non-parametric regression methods (which are quite different from previous regression models). A common feature of the remedial measures and alternative regression methods is that estimation procedures from the ways we’ve seen are relatively complex, so we need easier and more generic way to evaluate the precision of these complex estimators. Bootstrapping is one example.

## Outline

1. Other remedial measures
• unequal error variance - weighted least squares
• multicollinearity - ridge regression
• influential cases - robust regression
2. Nonparametric regression
• regression trees
3. Bootstrap confidence intervals

## 1. Other remedial measures

### unequal error variance - weighted least squares

Generalized multiple regression model $$Y_i = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_{p-1} X_{i,p-1} + \epsilon_i$$ where $$e_i$$’s are independent, $$e_i \sim N(0, \sigma_i^2)$$, $$i = 1,2,\cdots, n$$. Here, the error variances may not be equal, so that $$\sigma^2(\epsilon)$$ is an $$n\times n$$ diagonal matrix where each entry has $$\sigma_i^2$$.

When we use ordinary least square estimators, then still we have these properties:

1. the estimators of $$\beta$$ are unbiased
2. the estimators are consistent consistency is defined that $$\forall \varepsilon >0, P(|b_n - \beta| \leq \varepsilon) = 1$$ for succiciently large $$n$$ ($$b_n$$ converges to $$\beta$$). However, the estimators are no longer have minimum variance. This is due to the unequal error variances, making $$n$$ cases no longer have the same reliability. In other words, observations with small variances provide more reliable information.

A remedy to handle this problem is to use weighted least squares. First, we start from the simplest case.

when error variances are known Suppose that the errror variances $$\sigma_i^2$$ for all $$n$$ observations are given. Then we use Maximum likelihood method, again. First, we define the likelihood function $$L(\beta)$$.

$L(\beta) = \prod_{i=1}^n \frac{1}{(2\pi \sigma_i^2)^\frac{1}{2}} exp [ - \frac{1}{2 \sigma_i^2} (Y_i - \beta_0 - \beta_1 X_{i1} - \cdots - \beta_{p-1} X_{i,p-1})^2 ]$

Here, we substitute $$w_i = \frac{1}{\sigma_i^2}$$ and rewrite the formula.

$L(\beta) = \prod_{i=1}^n \left( \frac{w_i}{2\pi}\right)^\frac{1}{2} exp [ - \frac{w_i}{2} (Y_i - \beta_0 - \beta_1 X_{i1} - \cdots - \beta_{p-1} X_{i,p-1})^2 ]$

Since $$\sigma_i^2$$’s are known, parameters for likelihood function are just regression coefficients. Maximum likelihood method finds critical points which maximizes the likelihood function value.

This is just as the same as minimizing $$Q_w = \sum_{i=1}^n w_i (Y_i - \beta_0 - \beta_1 X_{i1} - \cdots - \beta_{p-1} X_{i,p-1})^2$$. The normal equation can be written using weights matrix $$W = diag(w_1, w_2, \cdots, w_n)$$ as below:

$X^t WX b_w = X^tW Y$

$$\Rightarrow b_w = (X^t WX)^{-1} X^tW Y$$ where $$\sigma^2(b_w) = (X^t WX)^{-1}$$. This is called Weighted least square estimator.

when variances are known up to proportionality constant Now, relaxing the condition, assume that we only know proportions among error variances. Consider a case that the second error variance is twice larger than the first one. i.e. $$\sigma_2^2 = 2*\sigma_1^2$$. Then we can write $$w_i$$’s in this form: $$w_i = \frac{1}{k_i\sigma^2}$$ where $$\sigma^2$$ is unknown true error variance of $$\epsilon_1$$. Then, $$p \times p$$ matrix $$\sigma^2 (b_w) = \sigma^2 (X^t K X)^{-1}$$, where $$K = diag(k_1, k_2, \cdots, k_n)$$. Since $$\sigma^2$$ is unknown, we use $$s^(b_w) = MSE_w (X^t K X)^{-1}$$, instead, where $$MSE_w = \frac{\sum_{i=1}^n w_i (Y_i - \hat{Y_i})^2}{n-p} = \frac{\sum_{i=1}^n w_i e_i^2}{n-p}$$.

when error variances are unknown We use estimators of the error variances. There are two approaches.

1. Estimation by residuals $$\sigma_i^2 = E(\epsilon^2) - [E(\epsilon)]^2 = E(\epsilon^2)$$. Hence, the squared residual $$e_i^2$$ is an estimator of $$\sigma_i^2$$.
2. Use of replicates or near replicates
• In a designed experiment (such as scientific simulations), replicate observations could be made. If the number of replicates is large, use the fact that sample variance gets closer to variance.
• However, in an observational studies (such as social studies), we use near observations as replicates.

Inference procedures should be followed when weights are estimated. We want to estimate a variance-covariance matrix $$\sigma^2(b_w)$$. The confidence intervals for regression coefficients are estimated with $$s^2(b_{w_k})$$ where $$k$$ stands for a bootstrap index. Then apply bootstraping method.

Lastly, we can use the ordinary least squares leaving the error variances unequal. Regression coefficient $$b$$ is still unbiased and consistent, but no longer a minimum variance estimator, where its variance is given as $$\sigma^2(b) = (X^tX)^{-1} X^t \sigma^2(\epsilon)X (X^tX)^{-1}$$ and it is estimated by $$s^2(b) = (X^tX)^{-1} X^t S_0 X (X^tX)^{-1}$$ where $$S_0 = diag(e_1^2, e_2^2, \cdots, e_n^2)$$.

Remark on weighted least squares Weighted least squares can be interpreted as transforming the data $$(X_i, Y_i, \epsilon_i)$$ by $$W^{\frac{1}{2}}$$. This can be written as

$W^{\frac{1}{2}}Y = W^{\frac{1}{2}}X\beta + W^{\frac{1}{2}}\epsilon$

and by setting $$Y_w = W^{\frac{1}{2}} Y$$, $$X_w = W^{\frac{1}{2}} X$$, and $$\epsilon_w = W^{\frac{1}{2}} \epsilon$$, the transformed variables follow the ordinary linear regression model

$Y_w = X_w \beta + \epsilon_w$

where $$E(\epsilon_w) = W^{\frac{1}{2}} E(\epsilon) = W^{\frac{1}{2}} 0 = 0$$ and $$\sigma^2(\epsilon_w) = W^{\frac{1}{2}} \sigma^2(\epsilon) W^{\frac{1}{2}} = W^{\frac{1}{2}} W^{-1} W^{\frac{1}{2}} = I$$. The regression coefficients are given as $$b_w = (X_w^t X_w)^{-1} X_w^t Y_w$$.

### multicollinearity - ridge regression

When multicollineariry is detected in a regression model, we have fixed the model by

1. using centered data in polynomial regression models
2. drop some redundant predictor variables
3. add some observations that could break the multicollinearity
4. use pricipal components instead of the current variables $$X_k$$
5. Ridge regression.

Ridge regression perturb the estimated coefficient $$b_R$$ from the unbiased estimator $$b$$, and therefore it could remedy multicollinearity problem. In other words, our unbiased estimator $$b$$ is obtained by the observation $$X$$ with multicollinearity, but slightly perturbed estimator $$b_R$$ is obtained by a virtual perturbed observation $$X_R$$ without multicollinearity. When perturbing $$b_R$$, we add restriction on $$b_R$$ to have small norm(length). Length restriction makes the variance of the sampling distribution of $$b_R$$ shrink. Although, $$b_R$$ is no longer an unbiased estimator, we have more chance to obtain $$E(b_R)$$ from a sampled $$b_R$$ compared to obtaining $$E(b)=\beta$$ from a sampled $$b$$. The relationship is given as a figure.

ridge regression

To obtain $$b_R$$, we consider using correlation transformed $$X$$ observations. This process normalizes the variable $$X$$ adjusting the magnitude of $$SSE = (Y-X\beta)^t (Y-X\beta)$$. $$b_R$$ can be obtained by minimizing a modified sum suquared error function $$Q(\beta) = (Y-X\beta)^t (Y-X\beta) + c \beta^t \beta$$ where $$c$$ is a given biasing coefficient and $$c \leq 0$$. Squared $$\ell_2$$ norm of $$\beta$$ is added in the function. This allows to minimize both $$SSE$$ and the length of $$\beta$$, at the same time. The intensity to keep focus on minimizing the length of $$\beta$$ can be adjusted by choosing the coefficient $$c$$. Larger the value $$c$$, the minimizing process would be more focused on minimizing the length of $$\beta$$. Resulting normal equation is $$(X^t X + cI) b_R = X^t Y$$. Therefore, $$b_R = (X^tX + cI)^{-1} X^tY$$.

Choice of biasing constant $$c$$ As the biasing constant $$c$$ increases, the bias of the model would be increased. But there exists a value of $$c$$ for which the regression estimator $$b_R$$ has a smaller total mean squared error $$E(b^R - \beta)^2$$ than the ordinary LSE $$b$$. But optimum value of $$c$$ varies from one application to another and it is unknown. Common heuristics to determine an appropriate $$c$$ are:

1. using Ridge trace Try several $$c$$ and plot $$b_R$$’s according to values of $$0 \leq c \leq 1$$. Take the point $$c^\ast$$ where the fluctuation in $$b_R$$ ceases.
ridge trace
1. using $$VIF_k$$ Try several $$c$$ and plot $$VIF_k$$’s according to values of $$0 \leq c \leq 1$$. Take the point $$c^\ast$$ where the fluctuation in $$b_R$$ ceases.

### influential cases - robust regression

When an outlying influential case is not clearly erroneous, we should proceed a further examination to obtain important clues to improve the model. We may find

• the omission of an important predictor variable,
• incorrect functional forms, or
• needs to dampen the influence of such cases.

Robust regression is an approach of regression to make the results less rely on outlying cases. There are several ways to conduct robust regression:

1. LAR (= LAD = minimum $$\ell_1$$-error) regression Least Absolute Residuals (= Least Absolute Deviations = minimum ell-one-error regression) is done by minimizing $$Q_1(\beta) = \sum_{i=1}^n | Y_i - \beta_0 - \beta_1 X_{i1} - \cdots - \beta_{p-1} X_{i,p-1} |$$. When there’s an outlier where its residual is large $$e_i > 1$$, then its squared residual gets much larger so that the regression function could overweigh this observation. This can be alleviated by using absolute values of residuals. Note that $$\ell_1$$ function is not differentiable. Thus, we cannot obtain an estimator $$\hat{b}$$ using partial derivatives. Instead, this problem can be dealt by linear programming (a kind of optimization technique).
2. IRLS robust regression Iteratively Reweighted Least Squares. This approach uses the weighted least square procedure as introduced before, but, weights on residuals are now determined by the magnitude of how far outlying a case is. The weights are updated iteratively in a way to flatten the weights.
3. LMS regression Least Median Squares. Recall that average is sensitive to outliers, but median is not. Thus, we could minimize the function $$median (Y_i - \beta_0 - \beta_1 X_{i1} - \cdots - \beta_{p-1} X_{i,p-1})^2$$.

## 2. Nonparametric regression

For complex cases, it is hard to guess a right functional form (analytic expression) of the regression function. Also, as the number of parameter increases, the space of observations gets rarefied (there are fewer cases in a neighborhood), so that the regression function tends to behave erratically. For these cases, nonparametric regression could be helpful.

### regression trees

Regression tree is a powerful, yet computationally simple method of nonparametric regression. It requires virtually no assumptions, and it is simple to interpret. It is popular for explanatory studies, especially for large data sets.

regression tree

In each branch of a regression tree, the range of a specific predictor variable $$X_k$$ is partitioned into segments (one for $$X_k \leq C$$ and the other for $$X_k < C$$). In the terminal nodes of the tree, we evaluate the estimated regression fit by taking the mean value of the response in that segment. i.e. $$\hat{Y_i} = \sum_{j=1}^{n_s} Y_j$$ where $$Y_j$$’s belong to a segment where $$i$$th observation belongs to. Also, $$n_s$$ is the number of observations in the segment.

How to find a “best” regression tree is linked with

• the number of segmented regions $$r$$, and
• the split points $$X_k$$ that optimally divides the data into two sets at each branch.

The basic idea is to choose a model which minimizes the error sum of squres for the resulting regression tree. $$SSE = \sum_{t_i \in T} SSE(t_i)$$, where $$T$$ is the set of all terminal nodes, and $$SSE(t_i) = \sum_{j=1}^n (Y_j - \overline{Y_{t_i}})^2$$. Here, $$MSE = SSE/(n-r)$$) and $$R^2 = 1- \frac{SSE}{SSTO}$$. The number of candidate models are $$r(p-1)$$ where $$1 \leq r \leq n$$. The value of $$r$$ is usually chosen through validation studies as $$argmin_r{MSPR}$$.

## 3. Bootstrap confidence intervals

Above regression approaches are complex so that it is hard to estimate confidence intervals using the previous analysis on $$\sigma^2(b)$$ or $$s^2(b)$$. Instead, we can approximate confidence intervals by bootstrap method. There are many different procedures to gain bootstrap confidence intervals. One example is reflection method.

An $$1-\alpha$$ confidence interval of a parameter $$\beta$$ is $$b - (b^\ast(1-\alpha/2) - b) \leq \beta \leq b + (b - b^\ast(\alpha/2))$$ where $$b^\ast(1-\alpha/2)$$ and $$b^\ast(\alpha/2)$$ are obtained from bootstrap distribution.

]]>
RA ch10 Diagnostics2020-07-09T11:12:00-04:002020-07-09T11:12:00-04:00https://hyemingu.github.io/blog/2020/regression_ch10Ch 10: Diagnostics

Once we’ve established a model, the next step is to try to improve the model. Or if the model is stuck in a trouble, we should diagnose in which part the trouble comes. In linear regression model, the possible reasons would be

• improper functional form of a predictor variable
• outliers
• influential observations
• multicollinearity.

Here are some suggestions for detecting such factors.

## Outline

1. Model adequacy for a predictor variable
2. Outliers and influential cases
• definitions
• identifying high leverage observations
• identifying outliers
• identifying influential cases
3. Multicollinearity diagnostic

## 1. Model adequacy for a predictor variable

You draw added-variable plots to detect if there is an improper functional form of a predictor variable. For a regression model with 2 predictor variables $$X_1$$ and $$X_2$$, we want to find the regression effect for $$X_1$$ given that $$X_2$$ is already in the model. Here are the steps.

1.  Regress $$Y$$ on $$X_2$$ and obtain fitted values with $$X_2$$ data, $$\hat{Y_i}(X_2) = b_0 + b_1 X_{i2}$$ and its residuals, $$e_i(Y X_2) = Y_i - \hat{Y_i}(X_2)$$.
2.  Regress $$X_1$$(testing variable) on $$X_2$$ and obtain fitted values with $$X_2$$ data, $$\hat{X_{i1}}(X_2) = b_0^\ast + b_1^\ast X_{i2}$$ and its residuals, $$e_i(X_{i1} X_2) = X_{i1} - \hat{X_{i1}}(X_2)$$.
3. Draw added an added-variable plot (X-axis: $$e_i(X_{i1}\mid X_2)$$, Y-axis: $$e_i(Y \mid X_2)$$)
a. No additional information exists from adding $$X_1$$.
b. There is additional linear information from adding $$X_1$$.
c. There is additional nonlinear information from adding $$X_1$$.


## 2. Outliers and influential cases

### definitions

• An outlier is an observation whose response $$Y$$ is far from the rest of the observations.
• An observation has high leverage if it has “extreme” predictor $$X$$ values which is distant from other $$X$$ values. ex) one-variable case: An extreme $$X$$ value is simply one that is particularly high or low. ex) multiple-variables case: Extreme $$X$$ values may be particularly high or low for one or more predictors, or may be “unusual” combinations of predictor values.
• A data point is influential if it influences excessively in making a regression model. It can influence in the predicted responses, the estimated slope coefficients, or the hypothesis test results. Outliers and high leverage observations are likely to be influential.

When it comes to a model with one-predictor variable, we can simply look at scatter plots in order to identify any outliers and high leverage observations.

What if we have a model with multiple predictor variables? We can identify such observations by diagonal entries of the hat matrix $$H = X(X^tX)^{-1}X^t$$.

• We name $$h_{ii}$$ as leverage value. Leverage values can be calculated by $$h_{ii} = X_i^t (X^tX)^{-1}X_i$$.

### identifying high leverage observations

$$h_{ii}$$ has useful properties such as

1. $0 \leq h_{ii} \leq 1$
2. $\sum_i h_{ii} = rank(H) = p$
3. $Var(e_i) = (1-h_{ii})\sigma^2$

By property 2, (mean of $$h_{ii}$$) = $$\frac{p}{n}$$. Also, by property 3, we can see that if $$h_{ii}$$ is large, then the observed response $$Y_i$$ plays a large role in the value of the predicted response $$\hat{Y_i}$$. $$\Rightarrow$$ if $$h_{ii}$$ is twice larger than the average ($$h_{ii}>\frac{2p}{n}$$), then such case is a high leverage observation.

### identifying outliers

Recall We have used studentized residual $$r_i = \frac{e_i}{s^2(e_i)} = \frac{e_i}{\sqrt{MSE(1-h_{ii})}} \sim t(n-p)$$ to identify unusual $$Y$$ values. But this approach can have a risk. That is, a potential outlier pulled the estimated regression function toward them, so that it is not flagged as an outlier using the standardized residual criterion.

To address this issue, we use an alternative criterion using studentized residuals. We delete an observation $$Y_i$$ and fit a regression model $$\hat{Y_{(i)}}$$ on the remaining $$n–1$$ observations. Then, we compare the observed response value $$Y_i$$ to their fitted value from $$\hat{Y_{(i)}}$$. This produces deleted residuals $$d_i = Y_i - \hat{Y_{i(i)}}$$. Standardizing the deleted residuals produces studentized residuals.

We can prove that $$d_i = Y_i - \hat{Y_{i(i)}} = \frac{e_i}{1-h_{ii}}$$ proof where $$s^2(d_i) = MSE_{(i)}\left(1 + X_{(i)}^t(X_{(i)}^tX_{(i)})^{-1}X_{(i)} \right) = \frac{MSE_{(i)}}{1-h_{ii}}$$. Moreover, we can obtain $$MSE_{(i)}$$ using $$MSE$$ by the formula $$(n-p)MSE = (n-p-1)MSE_{(i)} + \frac{e_i^2}{1-h_{ii}}$$. Therefore, $$MSE_{(i)} = \frac{1}{n-p-1} \left((n-p)MSE - \frac{e_i^2}{1-h_{ii}} \right) = MSE + \frac{1}{n-p-1} \left(MSE - \frac{e_i^2}{1-h_{ii}}\right)$$.

Thanks to these efforts, we don’t need to fit $$n$$ distinct regression lines. And we test for this alternative standardized residual $$t_i = \frac{d_i}{s(d_i)} \sim t(n-p-1)$$. If $$\mid t \mid$$ is large (from large $$h_{ii}$$), we can conclude that following observation $$Y_i$$ is an outlier.

### identifying influential cases

DFFITS : Influence of the observation $$Y_i$$ on the $$i$$th fitted value $$\hat{Y_i}$$

$DFFITS_i = \frac{\hat{Y_i} - \hat{Y_{i(i)}}}{\sqrt{MSE_{(i)} h_{ii}}} = t_i \left( \frac{h_{ii}}{1-h_{ii}} \right)^\frac{1}{2}$

If it is larger than 1 (for small to medium sized datasets) and larger than $$2\sqrt{\frac{p}{n}}$$ (for large datasets), then $$i$$th case is said to be influential.

Cook’s distance : Influence of the observation $$Y_i$$ on the fitted function $$\hat{Y}$$

$D_i = \frac{\sum_{j=1}^n (\hat{Y_j} - \hat{Y_{j(i)}})^2}{p \cdot MSE} = \frac{e_i^2}{p \cdot MSE} \frac{h_{ii}}{(1-h_{ii})^2} = \frac{1}{p}t_i^2\frac{h_{ii}}{1-h_{ii}}$

If $$D_i$$ is 50\% or more, then $$i$$th case is said to be influential.

DFBETAS : Influence of the observation $$Y_i$$ on the regression coefficients

$DFBETAS_{k(i)} = \frac{b_k - b_{k(i)}}{\sqrt{MSE_{(i)} C_{kk}}}, k=0,1,\cdots, p-1$

where $$C_{kk}$$ is the $$k$$th diagonal element of $$(X^tX)^{-1}$$. Note that $$C_{kk} = \frac{\sigma^2(b_k)}{\sigma^2}$$. If its absolute value is large, the $$i$$ th case is said to be influential.

## 3. Multicollinearity diagnostic

From a standardized regression model, we obtained correlation matrix $$r_{XX} = x^tx$$. Now, we have variance-covariance matrix of standardized regression coefficients $$\sigma^2(b^\ast) = (\sigma^\ast)^2 r_{XX}^{-1}$$.

This is the standardized version of $$\sigma^2(b) = \sigma^2 (X^tX)^{-1}$$.

Let us define the $$k$$th variance inflation factor $$VIF_k$$ as the $$k$$th diagonal element of $$r_{XX}^{-1}$$. We can find that $$\sigma^2(b_k^\ast) = (\sigma^\ast)^2 VIF_k$$.

$$VIF_k \geq 1$$, and we can diagnose multicollinearity among predictor variables by its magnitude.

• If $$VIF_k = 1$$, then $$X_k$$ is not linearly related to other predictor variables.
• If $$VIF_k \gg 1$$, then $$X_k$$ has inter-correlation with other predictor variables.
• If $$VIF_k = \infty$$, then $$X_k$$ is perfectly linearly associated with other predictor variables.

If $$VIF_k \gg 10$$ or $$\overline{VIF}= \frac{\sum_{k=1}^{p-1} VIF_k}{p-1}$$ is considerably larger than 1, we can say that $$k$$th predictor variable or the entire model has multicollinearity.

]]>