Differences between Training, Validation, and Test Set in Machine Learning

When tackling a supervised machine learning task, the developers of the machine learning solution often divide the labelled examples available to them into three partitions: a training set, a validation set, and a test set. To understand their differences, it is useful to examine how the need for such division can arise during the development process of machine learning solutions.

When developing solutions to a supervised machine learning task, our ultimate goal is usually to find an classification / regression model that is able to produce accurate predictions for some unseen inputs.

To achieve this, we start by supplying a learning algorithm with the training set; the learning algorithm can then set out to fine-tune the parameters in our model such that the model would make the least amount of mistakes overall when asked to reproduce the correct outputs using the inputs contained in the training set.

At this point, our two main assumptions are that:

  1. the model learnt by the algorithm is able to capture the underlying relationships between the inputs and the outputs as appeared in the training set, and;
  2. such relationships are true in general, and hence can be carried over to infer the correct outputs for input data that are not seen in the training set.

The validity of the first assumption can be verified by observing the in-sample error of the model during the training phase --- that is, the amount of errors made by the model when it is apply on the same data it was trained with. To verify the second assumption, we need to evaluate the performance of the trained model on a separate validation set -- a collection of labelled examples that were not shown to the model during its training; the validation set effectively serves as the “unseen inputs” for the trained model.

In practice, the validation set is used to select the model that exhibits the best ability to generalize over unseen data; more specifically, it is used to identify the model that shows the lowest generalization error (also called the out-of-sample error) from many candidate models. Some typical use-cases include:

  1. Hyper-parameter tuning: For many machine learning models, there are certain aspects of the model that are not easily “trainable” via the learning algorithm, e.g. the structure of a neural network, the amount of regularization applied to the weights, the learning rates of the gradient descent procedure, etc. Such parameters are called “hyper-parameters”. A exhaustive “grid search” is often used to find the optimal hyper-parameters that yield the model with best predictive performance over the validation set after training.
  2. Model selection: It is common for machine learning practitioners to train several different types/families of models using the same training set and pick the most performant model type after comparing their predictive performance over the validation set.
  3. Detecting model overfit: Complex models have a strong tendency to overfit when trained over a limited amount of data. A model is said to become overfit when it starts to pick up noises and rely on trivial patterns in the training data to make predictions; such models tend to generalize poorly on unseen inputs. Overfitted models can be identified as exhibiting high in-sample performances during training, but poor out-of-sample performances when evaluated on the validation set.

The training set together with the validation set as a whole is sometimes referred to as the development set, as they are used to develop the machine learning solution.

Once the most performant model is finalized based on the development set, any claims about the performance of the machine learning solution must be reported based on an evaluation of the solution’s performance over the test set -- another separate set of labelled examples withheld from the entire model development process. In most rigorous experiments, the test set should be kept away from the developers to prevent them from compromising the results by adapting their models based on the feedback derived from evaluations with the test set.