Feature selection is the process of selecting a subset of relevant features for use in model construction (Guyon and Elisseff, 2003). This does not necessarily involve selecting all the relevant variables (e.g. if they are correlated). There are three goals behind it: understanding better the process that generated the data, improving the performance of the predictors and creating faster and more cost-effective predictors.

We will use the following notation for a supervised learning setup: we will have a set of m examples {x_{k}, y_{k}} (k = 1,â€¦m), consisting of n input variables x_{k,i} (i = 1,â€¦n) and one output variable y_{k}. If the input vector x can be interpreted to come from an underlying, unknown, random distribution, , X_{i} is the random variable corresponding to the i-th component of x. Equivalently, y is the realization of the Y random variable.

In the following sections, we will present methods for feature selection in increasing complexity.

In a variable ranking, a scoring function S(x_{k,i}, y_{k}) evaluates the relevance of feature i. This relevance can be used to decide if a feature will make it to the predictor or not. Variable ranking is usually a preprocessing step, independent of the predictor of choice, that is followed by a filtering. Regarding S(i) can be:

- Correlation criteria: they detect dependencies between X
_{i}and Y, linear or not. Some examples are Pearson correlation, T-test criterion and Fisherâ€™s criterion. - Single variable classifiers: we select variables based on their individual predictive power, according to the performance (e.g. AUC) of a classifier built with that variable only.
- Information criteria: we study the mutual information between each variable and the target. It is especially useful for qualitative variables, as quantitative ones require knowledge on the densities p(x
_{i}), p(y) and P(x_{i},y).