# Cross-validation

We split the data in two parts. We train the model using one of them (training set), and test it on the other (testing set).

# Information approaches/penalized log-likelihood

If we knew the right subset of features to study, we would just need to maximize the fit to the observed data ie maximize the log-likelihood. Because we do not, we can consider parsimony along with the fit, and maximize the following:

Where:

• $\hat{\theta}$ is a vector of parameters (for linear models, this is ($\hat{β}, σ^{2}$)).
• $L(X,y,\hat{\theta})$ is the likelihood function of the model.
• $c(\hat{\theta})$ is a measurement of model complexity, usually some norm that measures how big \hat{\theta} is.

In general, these measurements take this form:

Where:

• $\lambda$ is a factor that controls the penalty for complexity.
• $p_{in}$ is the number of parameters included in the model.

## Likelihood

The likelihood of a model M is $L=p(x|\hat{\theta},M)$. It assumes the data is generated by the model plus some gaussian noise.

## Akaike Information Criterion

The AIC of Akaike [1973] is motivated by the Kullback-Leibler discrepancy, which (loosely) measures how far a model is from the truth. We assume that the population or process from which the data was sampled is governed by an unknown, perhaps nonparametric, true likelihood function $f(X,y)$, and we want to approximate the unknown $f$ by a model-specific parametric likelihood $g(X,y|θ)$. The “discrepancy” between $f$ and $g$, or “information lost” by representing $f$ by $g$, is defined as $KL(f, g) = E(\ln f(x))− E(\ln g(X,y|θ)).$

$E(ln f(x))$ is unknown and the same for all the models being compared, so we really only care about $E(\ln g(X,y|θ))$. It must be adjusted by the fact that θ is estimated from X and y:

Akaike showed that a good estimator for this term is $L(X,y,\hat{\theta})-p_{in}$. Hence the best model should minimize this function:

AIC doesn’t work well when $n$ is small (eg $n < 40p$), such as in GWAS. Small-sample corrections have been proposed. One such example is $AIC_c$:

## Bayesian Information Criterion

The Bayesian Information Criterion (BIC) assumes that the true model is present among the models. It tries to find it looking for the most probable model given the data:

Assuming equal priors (ie ignorance), $Pr(M_i|x,y)\propto Pr(y|x,M_i)$ and $Pr(y|x,M_i)$ can be well approximated by $exp(1/2BIC)$, where

$(p_{in}+ 2)$ is the number of parameters in the model, including the intercept and variance. Thus if we want to choose the model with the highest posterior probability, we need only choose the one with lowest BIC. Classically, $n$ here is the number of subjects and is thought to represent the amount of information in the sample.

### Comparison to AIC

When BIC and AIC agree, we can be confident about our model. When they do not, it really depends on our data:

• High $n$: BIC is usually preferred: BIC has a stricter selection criteria, and tends to pick smaller models, while AIC tends to overfit. Additionally BIC is a consistent model selection criteria: if the number of models is finite and the true one is among them, the probability of detecting it approaches 1 as $n$ tends to infinite.
• Small $n$: with the rather modest sample sizes we usually work with, BIC tends to underfit. Also, this nice asymptotic property is based on unrealistic assumptions. In those cases, AIC or AICc are recommended.

### Modifications of BIC

These criteria were based on the assumption that as the sample size $n$ goes to infinity, the total number of available regressors $p$ remains constant (Żak-Szatkowska & Bogdan, 2011). Therefore, they might not be appropriate if $p$ is comparable or larger than $n$, as it happens in GWAS. For example, in QTL mapping studies it was shown that BIC leads to an overestimation of the numbers of regressors. Several modifications have been proposed to tackle this problem:

• EBIC and mBIC include prior distributions on the model size: uniform and binomial respectively.

When it is assumed that only a small subset of $p$ is explanatory, usually a sparsity is more enforced.

The $\lambda$ factor is a data-adaptive penalty derived using the generalized degrees of freedom for a given modeling procedure.