Mathematics/Statistics

# Simple Linear Regression and Correlation

## Introduction to Linear Regression

• $Y = \beta_0 + \beta_1 x$
• For multiple variables, $Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2$

## The Simple Linear Regression Model

• $Y = \beta_0 + \beta_1 x + \epsilon$
• $E(\epsilon) = 0$
• $\hat y = b_0 + b_1 x$

## Least Squares and the Fitted Model

### The Method of Least Squares

• SSE = sum of squares of errors
• We shall minimize $\displaystyle SSE = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - \hat y_i)^2 = \sum_{i=1}^n (y_i - b_0 - b_1 x_i)^2$
• From differentiating it, we obtain
• $\displaystyle {\partial (SSE) \over \partial b_0} = -2 \sum_{i=1}^n (y_ i - b_0 - b_1 x_i)$ and
• $\displaystyle {\partial (SSE) \over \partial b_1} = -2 \sum_{i=1}^n (y_i - b_0 - b_1 x_i) x_i$
• Setting partial derivatives to $0$, we obtain $\displaystyle \begin{cases} nb_0 + b_1 \sum x_i = \sum y_i \ b_0 \sum x_i + b_1 \sum x_i^2 = \sum x_i y_i \end{cases}$
• $\displaystyle b_1 = {\sum (x_i - \bar x)(y_i - \bar y) \over \sum (x_i - \bar x)^2}$
• and $b_0 = \bar y - b_1 \bar x$
• $\hat \beta = (X^T X)^{-1} X^T y$ (if intercept is 0)

## Properties of the Least Squares Estimators

Let $Y_i = \beta_0 + \beta_1 x_i + \epsilon_i$

• Since the estimator
• $\displaystyle B_1 = {\sum (x_i - \bar x)(Y_i - \hat Y) \over \sum(x_i - \bar x)^2} = {\sum (x_i - \bar x)Y_i \over \sum (x_i - \bar x)^2}$ is in the form of $\sum c_i Y_i$, where
• $\displaystyle c_i = {x_i - \bar x \over \sum (x_i - \bar x)^2}$
• We may deduce that
• $\displaystyle \mu_{B_1} = {\sum (x_i - \bar x) (\beta_0 + \beta_1 x_i) \over (x_i - \bar x)^2} = \beta_1$
• $\displaystyle \sigma_{B_1}^2 = {\sigma^2 \over \sum (x_i - \bar x)^2}$
• And similarly for $B_0$,
• $\mu_{B_0} = \beta_0$
• $\displaystyle \sigma_{B_0}^2 = {\sum x_i^2 \over n \sum(x_i - \bar x)^2} \sigma^2$
• An unbiased estimate of $\sigma^2$ is
• $\displaystyle s^2 = {SSE \over n-2} = {S_{yy} - b_1 S_{xy} \over n-2}$
• Note that $\displaystyle S_{xx} = \sum (x_i - \bar x)^2, S_{yy} = \sum (y_i - \bar y)^2, S_{xy} = \sum (x_i - \bar x)(y_i - \bar y)$

### The Estimator of $\sigma^2$ as a MSE

• A $100(1 - \alpha) %$ confidence interval for $\beta_1$ is
• $\displaystyle b_1 - t_{\alpha/2} {s \over \sqrt S_{xx}} < \beta_1 < b_1 + t_{\alpha/2} {s \over \sqrt S_{xx}}$ where d.f. is $n-2$

### Hypothesis Testing on the Slope

• One can do hypothesis testing based on $\displaystyle t = {b_1 - \beta_{10} \over s/ \sqrt{S_{xx}}}$ where $\beta_1 = \beta_{10}$ (null hypothesis)
• Note that d.f. is $n-2$

### Statistical Inference on the Intercept

• A $100 (1 - \alpha) %$ confidence interval for $\beta_0$ is
• $\displaystyle b_0 - t_{\alpha/2} {s \over \sqrt{n S_{xx}}} \sqrt{\sum x_i^2} < \beta_0 < b_0 + t_{\alpha/2} {s \over \sqrt{n S_{xx}}}\sqrt{\sum x_i^2}$
• where d.f. is $n-2$

### Use of $R^2$

• For $SSE = \sum (y_i - \hat y_{i})^2$ and $SST = \sum (y_i - \bar y_i)^2$, $R^2 = 1 - SSE / SST$
• It is good if $R^2 \approx 1$

## Prediction

• A $100 (1 - \alpha) %$ confidence interval for the mean response $\mu_{Y | x_0}$ is
• $\displaystyle \hat y_0 - t_{\alpha/2} s \sqrt{{1 \over n} + {(x_0-\bar x)^2 \over S_{xx}}} < \mu_{Y | x_0} < \hat y_0 + t_{\alpha/2} s \sqrt{{1 \over n} + {(x_0-\bar x)^2 \over S_{xx}}}$
• where d.f. is $n-2$
• A $100(1 - \alpha) %$ prediction interval for a single response $y_0$ is
• $\displaystyle \hat y_0 - t_{\alpha/2} s \sqrt{1 + {1 \over n} + {(x_0 - \bar x)^2 \over S_{xx}}} < y_0 < \hat y_0 + t_{\alpha/2} s \sqrt{1 + {1 \over n} + {(x_0 - \bar x)^2 \over S_{xx}}}$
• where d.f. is $n-2$

#### 'Mathematics > Statistics' 카테고리의 다른 글

 Simple Linear Regression and Correlation  (0) 2021.10.24 2021.10.24 2021.10.24