We split the data in two parts. We train the model using one of them (training set), and test it on the other (testing set).
If we knew the right subset of features to study, we would just need to maximize the fit to the observed data ie maximize the log-likelihood. Because we do not, we can consider parsimony along with the fit, and maximize the following:
Where:
In general, these measurements take this form:
Where:
The likelihood of a model M is $L=p(x|\hat{\theta},M)$. It assumes the data is generated by the model plus some gaussian noise.
The AIC of Akaike [1973] is motivated by the Kullback-Leibler discrepancy, which (loosely) measures how far a model is from the truth. We assume that the population or process from which the data was sampled is governed by an unknown, perhaps nonparametric, true likelihood function $f(X,y)$, and we want to approximate the unknown $f$ by a model-specific parametric likelihood $g(X,y|θ)$. The âdiscrepancyâ between $f$ and $g$, or âinformation lostâ by representing $f$ by $g$, is defined as $KL(f, g) = E(\ln f(x))â E(\ln g(X,y|θ)).$
$E(ln f(x))$ is unknown and the same for all the models being compared, so we really only care about $E(\ln g(X,y|θ))$. It must be adjusted by the fact that θ is estimated from X and y:
Akaike showed that a good estimator for this term is $L(X,y,\hat{\theta})-p_{in}$. Hence the best model should minimize this function:
AIC doesnât work well when $n$ is small (eg $n < 40p$), such as in GWAS. Small-sample corrections have been proposed. One such example is $AIC_c$:
The Bayesian Information Criterion (BIC) assumes that the true model is present among the models. It tries to find it looking for the most probable model given the data:
Assuming equal priors (ie ignorance), $Pr(M_i|x,y)\propto Pr(y|x,M_i)$ and $Pr(y|x,M_i)$ can be well approximated by $exp(1/2BIC)$, where
$(p_{in}+ 2)$ is the number of parameters in the model, including the intercept and variance. Thus if we want to choose the model with the highest posterior probability, we need only choose the one with lowest BIC. Classically, $n$ here is the number of subjects and is thought to represent the amount of information in the sample.
When BIC and AIC agree, we can be confident about our model. When they do not, it really depends on our data:
The $\lambda$ factor is a data-adaptive penalty derived using the generalized degrees of freedom for a given modeling procedure.
Google: Bayes factor, model evidence.