Early stopping

In machine learning, early stopping is a form of regularization used to avoid overfitting when training a model with an iterative method, such as gradient descent. Such methods update the model to make it better fit the training data with each iteration. Up to a point, this improves the model's performance on data outside of the training set (e.g., the validation set). Past that point, however, improving the model's fit to the training data comes at the expense of increased generalization error. Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit. Early stopping rules have been employed in many different machine learning methods, with varying amounts of theoretical foundation.

Background

This section presents some of the basic machine-learning concepts required for a description of early stopping methods.

Overfitting

thumb|Figure 1. The green line represents an overfitted model and the black line represents a regularized model. While the green line best follows the training data, it is too dependent on that data and it is likely to have a higher error rate on new unseen data illustrated by black-outlined dots, compared to the black line.

Machine learning algorithms train a model based on a finite set of training data. During this training, the model is evaluated based on how well it predicts the observations contained in the training set. In general, however, the goal of a machine learning scheme is to produce a model that generalizes, that is, that predicts previously unseen observations. Overfitting occurs when a model fits the data in the training set well, while incurring larger generalization error.

Regularization

Regularization, in the context of machine learning, refers to the process of modifying a learning algorithm so as to prevent overfitting. This generally involves imposing some sort of smoothness constraint on the learned model.

This smoothness may be enforced explicitly, by fixing the number of parameters in the model, or by augmenting the cost function as in Tikhonov regularization. Tikhonov regularization, along with principal component regression and many other regularization schemes, fall under the umbrella of spectral regularization, regularization characterized by the application of a filter. Early stopping also belongs to this class of methods.

Gradient descent methods

Gradient descent methods are first-order, iterative, optimization methods. Each iteration updates an approximate solution to the optimization problem by taking a step in the direction of the negative of the gradient of the objective function. By choosing the step-size appropriately, such a method can be made to converge to a local minimum of the objective function. Gradient descent is used in machine-learning by defining a loss function that reflects the error of the learner on the training set and then minimizing that function.

Early stopping based on analytical results

Early stopping in statistical learning theory

Early-stopping can be used to regularize non-parametric regression problems encountered in statistical learning theory. For a given input space, <math>X</math>, output space, <math>Y</math>, and samples drawn from an unknown probability measure, <math>\rho</math>, on <math>Z = X \times Y</math>, the goal of such problems is to approximate a regression function, <math>f_{\rho}</math>, given by

:<math> f_\rho(x) = \int_Y y \, d\rho(y\mid x),\, x \in X,</math>

where <math>\rho(y\mid x)</math> is the conditional distribution at <math>x</math> induced by <math>\rho</math>.

One common choice for approximating the regression function is to use functions from a reproducing kernel Hilbert space.

Example: Least-squares loss

(Adapted from Yao, Rosasco and Caponnetto, 2007

L-boosting

Boosting methods have close ties to the gradient descent methods described above can be regarded as a boosting method based on the <math>L_{2}</math> loss: LBoost.

Cross-validation is an alternative that is applicable to non time-series scenarios. Cross-validation involves splitting multiple partitions of the data into training set and validation set – instead of a single partition into a training set and validation set. Even this simple procedure is complicated in practice by the fact that the validation error may fluctuate during training, producing multiple local minima. This complication has led to the creation of many ad hoc rules for deciding when overfitting has truly begun.