Linear regression models assume that the output variable $Y$ can be explained (or reasonably approximated) as a linear combination of the input vector $X$:
where
The most common method to estimate $\beta$ from the training data is the least squares, where the coefficients that minimize the residual sum of squares
RSS has two crucial properties worth expliciting. First, squaring the differences ignores the direction of the differences, and hence negative differences cannot offset positive diferences. Second, the square assigns a higher importance to big errors. We can rewrite RSS in matrix form as
To minimize RSS, we differentiate this quadratic function with respect to $\beta$
and, assuming $\textbf X$ has full column rank and hence $\textbf X^T\textbf X$ is positive definite, we set the first derivative to zero
This allows to obtain a unique solution
If $\textbf X$ is not full rank i.e. its columns are not linearly independent, then $\textbf X^T \textbf X$ is singular, and the least squares coefficients $\hat \beta$ are not uniquely defined.
The data we use is no more than a sample of all the possible cases. Hence, a different training data will lead to a different set of $\hat \beta$. If we assume that the observations $y_i$ are uncorrelated and have constant variance $\sigma^2$, and that the $x_i$ are fixed (non-random), the variance in the $\hat \beta$ is
If we also assume that the a lineal model is correct for the mean, and that deviations of $y$ around its expectation are additive and Gaussian ($\varepsilon \tilde{} N(0,\sigma^2)$),
Regarding $\sigma^2$, we can use an unbiased estimator
which follows a $\chi^2$ distribution.
With these properties we can define confidence intervals for the parameters $\beta_j$. We do so by using a Z-score to test if each coefficient $\beta_j = 0$:
where $v_j$ is the $j$th diagonal element of $(\textbf X^T\textbf X)^{-1}$. $z_j$ is distributed as $t_{N-p-1}$.
Let’s focus on the estimation of any linear combination of the parameters $\theta=a^T \beta$ (e.g. the predictions $f(x_0)=x_0^T\beta$). The least squares estimate is
If the model is linear, we can prove that $a^T\hat \beta$ is an unbiased estimator i.e. the expected value does not differ from the true value of the parameter being estimated:
Not only that, but the Gauss-Markov theorem states that the parameters $\hat \beta$ are the best linear unbiased estimator (BLUE), as its variance is the smallest. Let’s dig a bit into this. The mean squared error (MSE) of an estimator $\hat \theta$ in estimating $\theta$ is
where $Var(\hat \theta)$ is the variance, and $[E(\hat \theta - \theta)]^2$ the squared bias. Hence, among unbiased estimators (those where $[E(\hat \theta - \theta)]^2 = 0$), least squares estimator is the best one, as its variance, and hence the prediction error, are the smallest.
There are two main reservations about least squares estimates. The first one is about prediction accuracy: despite least squared being the best among unbiased estimators, there are biased estimators with better prediction accuracy (lower MSE). They achieve so by accepting some bias to reduce variance e.g. constraining the size of the model, setting coefficients to 0 and hence achieving lower variance than least squares. The second one is interpretation: least squares takes all the input variables as predictors, which complicates the understanding of the model.
Some methods couple variable subset selection to the linear regression, discarding input variables:
However, those methods often exhibit high variance and do not reduce prediction error.
Ridge regression shrinks the regression coefficients by imposing a penalty on their size:
where $\lambda\geq0$ is the complexity parameter that controls the size of the model: the larger $\lambda$, the greater the shrinkage, the higher the tendency of the coefficients towards zero. Note that the intersection $\beta_0$ is left out of the penalty term. A more explicit way of displaying the size constraint is
subject to $\sum_{j=1}^p\beta_j^2\leq t$. Rewriting it in matrix form we can see the ridge regression solutions:
where $\textbf I$ is the $p \times p$ identity matrix.
Several regularization methods have been proposed for ill-posed problems i.e. those where one or more of the following properties don’t apply:
We look for an estimate a parameter vector beta $\hat{\beta}$ that maximizes the likelihood of the model. Iterative solutions follow this algorithm: