Ensembles in Auger

Among the many advanced automated machine learning techniques which Auger offers, one is to provide better predictive results is ensemble generation. Ensemble generation is a powerful technique which aims to increase the predictive performance for a given machine learning task by combining several predictive models. It often improves the generalization error of the model and robustness to different data shifts. All of these improvements are useful for further inference.

Most popular ML libraries and platforms don’t provide ensembles out of the box. Instead, they assume that the data scientist or user will prepare them. Assembling optimal ensembles is typically beyond the expertise of most developers or business analysts.

Auger provides the latest state-of-the-art ensemble techniques, such as: Advanced Ensemble SelectionSuper Learner and Deep Super Learner. But it also allows using the classic methods of Voting and Averaging. These will enable you to define some “weights” for each base model. You can then control your own ensemble generation procedure.

Let’s dive a bit deeper: Auger provides a leaderboard of evaluated models with its parameters, which is used for ensemble construction stage later. This leaderboard could be consist of thousands of ordered models (by its score) or even more. Therefore Auger allows restricting the number of models, which are planned to use in ensembles later (to decrease the construction time). Advanced ensemble methods rely on cross-validation predictions from each model.

Advanced Ensemble Selection

Auger’s advanced ensembled selection is an improved version of the previously proposed Ensemble Selection algorithm. This method also provides guarantees that the ensemble has a performance at least as good as the best model within the ensemble. Using ensembles can only increase performance of the model.

All models, no matter what their performance, are added to the model library for the problem. The expectation is that some of the models will yield good performance on the problem, either in isolation or in combination with other models, for any reasonable performance metric. The “selection” procedure could be as follows where we step-by-step add some models to an existing ensemble:

auger-ensemble.png

The algorithm uses smart initialization, which can be controlled parametrically (but by default starts with an empty set of parameters). Bagging to reduce the variance over selected models. Auger also supports sets of ensembles, where best models can be added to some ensemble more than once. We use this in the inference stage for weighted prediction.s

Super Learner

This ensemble technique strongly relies on the cross-validation predictions (using “soft” labels for classification) to form what is called the “level-one” data on which the meta-learner is trained on. This data looks like a prediction matrix with rows representing our samples and column representing selected models. It is represented as a cube for a classification task by adding one more dimension which is equal to number of classes. Meta-learning is performed using the L-BFGS, SLSQP and NNLS (regression) algorithms.

The ensemble construction algorithm is:

1. Define inputs:

a. Specify a set of N base models (from the leaderboard).

b. Specify a meta-learning algorithm (L-BFGS, SLSQP, NNLS).

The algorithm also has access to the source data and cross-validation predictions for selected models.

2. Construction (selection) stage:

The k-fold cross-validation forms for each of N selected to model the predicted values for each sample in the provided data (S samples for example).

We represent these as an S x N matrix or S x N x C (number of classes) cube.

a. Train the meta-learning algorithm on this matrix (cube) with regard to optimization constraints to get the weight for each provided model.

b. Eliminate some models with the weights under a specified threshold.

3. Prediction:

Perform a weighted prediction with respect to the specified weight.

Deep Super Learner

This is similar to super learner, but the construction stage is divided on several iterations. After the first construction iteration is finished, the algorithm produces additional features equal to the number classes. These are added to the existing features and the next optimization iteration is performed. It is stopped if the loss function is beyond some defined threshold.

The same procedure is used for prediction. The principal scheme works as follows:

principal-scheme.png

Weighted Ensemble, Voting, Averaging

All of these methods are classic and could be used with some predefined weights. Weighted Ensemble is a base ensemble method for the Super Learner and Advanced Ensemble Selection, since these methods just produce the set of models with the weight for each model. It allows to use a tiny variant of ensemble after the construction procedure was performed by some huge ensemble.

Voting and Averaging can also use weights, but it uses them in more straightforward way within the Auger. Voting is often used for classification tasks, because it can use “soft” labels. It is well known that the voting works better for low-correlated models. And the averaging often reduces overfit.Voting and Averaging method typically use some intelligent selector which is used to select the best suitable models for each of them.