Linear regression models assume that the output variable $Y$ can be explained (or reasonably approximated) as a linear combination of the input vector $X$:


Least squares method

The most common method to estimate $\beta$ from the training data is the least squares, where the coefficients that minimize the residual sum of squares

RSS has two crucial properties worth expliciting. First, squaring the differences ignores the direction of the differences, and hence negative differences cannot offset positive diferences. Second, the square assigns a higher importance to big errors. We can rewrite RSS in matrix form as

To minimize RSS, we differentiate this quadratic function with respect to $\beta$

and, assuming $\textbf X$ has full column rank and hence $\textbf X^T\textbf X$ is positive definite, we set the first derivative to zero

This allows to obtain a unique solution

If $\textbf X$ is not full rank i.e. its columns are not linearly independent, then $\textbf X^T \textbf X$ is singular, and the least squares coefficients $\hat \beta$ are not uniquely defined.

Least squares as BLUE

Variance of the estimates

The data we use is no more than a sample of all the possible cases. Hence, a different training data will lead to a different set of $\hat \beta$. If we assume that the observations $y_i$ are uncorrelated and have constant variance $\sigma^2$, and that the $x_i$ are fixed (non-random), the variance in the $\hat \beta$ is

If we also assume that the a lineal model is correct for the mean, and that deviations of $y$ around its expectation are additive and Gaussian ($\varepsilon \tilde{} N(0,\sigma^2)$),

Regarding $\sigma^2$, we can use an unbiased estimator

which follows a $\chi^2$ distribution.

With these properties we can define confidence intervals for the parameters $\beta_j$. We do so by using a Z-score to test if each coefficient $\beta_j = 0$:

where $v_j$ is the $j$th diagonal element of $(\textbf X^T\textbf X)^{-1}$. $z_j$ is distributed as $t_{N-p-1}$.

The Gauss-Markov theorem

Let’s focus on the estimation of any linear combination of the parameters $\theta=a^T \beta$ (e.g. the predictions $f(x_0)=x_0^T\beta$). The least squares estimate is

If the model is linear, we can prove that $a^T\hat \beta$ is an unbiased estimator i.e. the expected value does not differ from the true value of the parameter being estimated:

Not only that, but the Gauss-Markov theorem states that the parameters $\hat \beta$ are the best linear unbiased estimator (BLUE), as its variance is the smallest. Let’s dig a bit into this. The mean squared error (MSE) of an estimator $\hat \theta$ in estimating $\theta$ is

where $Var(\hat \theta)$ is the variance, and $[E(\hat \theta - \theta)]^2$ the squared bias. Hence, among unbiased estimators (those where $[E(\hat \theta - \theta)]^2 = 0$), least squares estimator is the best one, as its variance, and hence the prediction error, are the smallest.


There are two main reservations about least squares estimates. The first one is about prediction accuracy: despite least squared being the best among unbiased estimators, there are biased estimators with better prediction accuracy (lower MSE). They achieve so by accepting some bias to reduce variance e.g. constraining the size of the model, setting coefficients to 0 and hence achieving lower variance than least squares. The second one is interpretation: least squares takes all the input variables as predictors, which complicates the understanding of the model.

Subset selection

Some methods couple variable subset selection to the linear regression, discarding input variables:

However, those methods often exhibit high variance and do not reduce prediction error.

Shrinkage methods

Ridge regression

Ridge regression shrinks the regression coefficients by imposing a penalty on their size:

where $\lambda\geq0$ is the complexity parameter that controls the size of the model: the larger $\lambda$, the greater the shrinkage, the higher the tendency of the coefficients towards zero. Note that the intersection $\beta_0$ is left out of the penalty term. A more explicit way of displaying the size constraint is

subject to $\sum_{j=1}^p\beta_j^2\leq t$. Rewriting it in matrix form we can see the ridge regression solutions:

where $\textbf I$ is the $p \times p$ identity matrix.


Regularization methods

Several regularization methods have been proposed for ill-posed problems i.e. those where one or more of the following properties don’t apply:

Fitting a logistic reg through an iterative solution

We look for an estimate a parameter vector beta $\hat{\beta}$ that maximizes the likelihood of the model. Iterative solutions follow this algorithm:

  1. Select $\hat{\beta}_0$, an initial guess for $\hat{\beta}$.
  2. Get a new $\hat{\beta}$ using a polynomial approximation of the likelihood.
  3. Calculate $C = \hat{\beta} - \hat{\beta}_0$.
  4. Repeat until $C < k$, where $k$ is a convergence criterion.