본문 바로가기

Mathematics/Statistics

Simple Linear Regression and Correlation

Introduction to Linear Regression

  • $Y = \beta_0 + \beta_1 x$
  • For multiple variables, $Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2$

The Simple Linear Regression Model

  • $Y = \beta_0 + \beta_1 x + \epsilon$
  • $E(\epsilon) = 0$
  • $\hat y = b_0 + b_1 x$

Least Squares and the Fitted Model

The Method of Least Squares

  • SSE = sum of squares of errors
  • We shall minimize $\displaystyle SSE = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - \hat y_i)^2 = \sum_{i=1}^n (y_i - b_0 - b_1 x_i)^2$
  • From differentiating it, we obtain
    • $\displaystyle {\partial (SSE) \over \partial b_0} = -2 \sum_{i=1}^n (y_ i - b_0 - b_1 x_i)$ and
    • $\displaystyle {\partial (SSE) \over \partial b_1} = -2 \sum_{i=1}^n (y_i - b_0 - b_1 x_i) x_i$
  • Setting partial derivatives to $0$, we obtain $\displaystyle \begin{cases}
    nb_0 + b_1 \sum x_i = \sum y_i \
    b_0 \sum x_i + b_1 \sum x_i^2 = \sum x_i y_i
    \end{cases}$
    • $\displaystyle b_1 = {\sum (x_i - \bar x)(y_i - \bar y) \over \sum (x_i - \bar x)^2}$
    • and $b_0 = \bar y - b_1 \bar x$
  • $\hat \beta = (X^T X)^{-1} X^T y$ (if intercept is 0)

Properties of the Least Squares Estimators

Let $Y_i = \beta_0 + \beta_1 x_i + \epsilon_i$

  • Since the estimator
    • $\displaystyle B_1 = {\sum (x_i - \bar x)(Y_i - \hat Y) \over \sum(x_i - \bar x)^2} = {\sum (x_i - \bar x)Y_i \over \sum (x_i - \bar x)^2}$ is in the form of $\sum c_i Y_i$, where
      • $\displaystyle c_i = {x_i - \bar x \over \sum (x_i - \bar x)^2}$
    • We may deduce that
      • $\displaystyle \mu_{B_1} = {\sum (x_i - \bar x) (\beta_0 + \beta_1 x_i) \over (x_i - \bar x)^2} = \beta_1$
      • $\displaystyle \sigma_{B_1}^2 = {\sigma^2 \over \sum (x_i - \bar x)^2}$
    • And similarly for $B_0$,
      • $\mu_{B_0} = \beta_0$
      • $\displaystyle \sigma_{B_0}^2 = {\sum x_i^2 \over n \sum(x_i - \bar x)^2} \sigma^2$
  • An unbiased estimate of $\sigma^2$ is
    • $\displaystyle s^2 = {SSE \over n-2} = {S_{yy} - b_1 S_{xy} \over n-2}$
  • Note that $\displaystyle S_{xx} = \sum (x_i - \bar x)^2, S_{yy} = \sum (y_i - \bar y)^2, S_{xy} = \sum (x_i - \bar x)(y_i - \bar y)$

The Estimator of $\sigma^2$ as a MSE

  • A $100(1 - \alpha) %$ confidence interval for $\beta_1$ is
    • $\displaystyle b_1 - t_{\alpha/2} {s \over \sqrt S_{xx}} < \beta_1 < b_1 + t_{\alpha/2} {s \over \sqrt S_{xx}}$ where d.f. is $n-2$

Hypothesis Testing on the Slope

  • One can do hypothesis testing based on $\displaystyle t = {b_1 - \beta_{10} \over s/ \sqrt{S_{xx}}}$ where $\beta_1 = \beta_{10}$ (null hypothesis)
  • Note that d.f. is $n-2$

Statistical Inference on the Intercept

  • A $100 (1 - \alpha) %$ confidence interval for $\beta_0$ is
    • $\displaystyle b_0 - t_{\alpha/2} {s \over \sqrt{n S_{xx}}} \sqrt{\sum x_i^2} < \beta_0 < b_0 + t_{\alpha/2} {s \over \sqrt{n S_{xx}}}\sqrt{\sum x_i^2} $
    • where d.f. is $n-2$

Use of $R^2$

  • For $SSE = \sum (y_i - \hat y_{i})^2$ and $SST = \sum (y_i - \bar y_i)^2$, $R^2 = 1 - SSE / SST$
  • It is good if $R^2 \approx 1$

Prediction

  • A $100 (1 - \alpha) %$ confidence interval for the mean response $\mu_{Y | x_0}$ is
    • $\displaystyle \hat y_0 - t_{\alpha/2} s \sqrt{{1 \over n} + {(x_0-\bar x)^2 \over S_{xx}}} < \mu_{Y | x_0} < \hat y_0 + t_{\alpha/2} s \sqrt{{1 \over n} + {(x_0-\bar x)^2 \over S_{xx}}}$
    • where d.f. is $n-2$
  • A $100(1 - \alpha) %$ prediction interval for a single response $y_0$ is
    • $\displaystyle \hat y_0 - t_{\alpha/2} s \sqrt{1 + {1 \over n} + {(x_0 - \bar x)^2 \over S_{xx}}} < y_0 < \hat y_0 + t_{\alpha/2} s \sqrt{1 + {1 \over n} + {(x_0 - \bar x)^2 \over S_{xx}}}$
    • where d.f. is $n-2$

'Mathematics > Statistics' 카테고리의 다른 글

Simple Linear Regression and Correlation  (0) 2021.10.24
One- and Two-Sample Tests of Hypotheses  (0) 2021.10.24
One- and Two-Sample Estimation Problems  (0) 2021.10.24