Super learning with MLens + Custom Estimators

Shah Mahdi Hasan
11 min readFeb 18, 2022

--

Photo by Esteban Lopez on Unsplash

The goal of this article is to provide a very high-level intuition on super learning. For the explanation, I will rely on first principles as much as possible while intentionally avoiding the proliferation of mathematical symbols. Upon building the foundation, we will examine how we can include super learning in our arsenal by constructing a multi-layer super learner using MLens. MLens natively supports the built-in estimators that come with popular machine learning packages (for example, Scikit-Learn). We will then explore how we can extend the use case of MLens by including an estimator developed by you in the super learner.

Noting the use of the term “estimators” instead of the commonly used term “models.” This is to align with the origin of super learning, which is in biostatistics, where the term “model” is reserved for probability distributions.

Before diving deep into the article, how about deconstructing the seemingly obscure title?

Let’s first take a look at the idea of meta-learning. Meta-learning is not about learning things in the so-called metaverse, but that doesn’t make it uncool. ‘Meta,’ as linguists remark, means a higher level of abstraction. In the context of machine learning (ML), a higher-level abstraction for a given dataset can be an empirical probability distribution. Alternatively, it can be an estimator that models the dataset, providing us with a predictive advantage. For the same set of data, we can create an array of estimators, each providing a different type of abstraction. Now, the question is, can we introduce another layer of abstraction over those estimators? The answer is yes. This is known as meta-learning, or learning from the learners. Super learning is a form of meta-learning.

Learning from the learners is not something new to ML practitioners. Ensemble-based techniques that combine an array of learners (for example, Random Forest and XGBoost) are now staples, forming the first line of defense in a wide range of predictive challenges. While Random Forest and XGBoost employ completely different types of ensembling strategies (bagging and boosting, respectively), they share a common trait: the homogeneity of their base learners. They are all, obviously, decision trees. In this article, we will instead focus on cases where the base learners do not necessarily share any traits. For ease of navigation, here is a small table of contents for you.

Table of Contents

To the uncharted territory

Now, let’s take a step into relatively uncharted territory: ensembling with a set of heterogeneous base learners. For example, some of these learners can be Kernel Support Vector Machines (SVM) with different kernels, while others can be Generalized Linear Models (GLM). An obvious first step that anyone would take is to simply average (possibly weighted) the outcomes of these base learners. In fact, this seemingly straightforward approach can yield a significant performance improvement, as demonstrated in several competitions hosted by Kaggle. But why does this simple strategy work? Consider the following figure.

Figure 1: One estimator is in the underestimation zone while the other is overestimating. The blue dot is the squared error produced by the convex combination ensembling strategy.

In the above, the horizontal axis is the error produced by the estimators in a particular split during the cross-validation stage. The vertical axis is the squared validation error. Here, y is the true label and y_hat is the estimated value produced by the estimators. Let, f1 and f2 be two estimators which are the candidate base learners for the demo here. The green dots in the error curve is the squared error produced by these estimators. In this instance f1 is underestimating while f2 is overestimating the true label. Clearly, if we are to select one estimator based on this split, we shall select the estimator f2 since it yields less squared error. Using some fundamental theory of probability, one can show that the resulting error produced by the linear combination a*f1 +b* f2 will always be in between the errors associated with the two green dots as long as the combination is convex. Convex combination is just a fancy way of saying a+b=1 for all the combination of (a,b). In this case, we clearly can see that there exist convex combinations for which the resulting ensemble is even better than f1 and f2 in terms of squared error. Are we better off using those convex combinations and avoid the hassle of estimator selection altogether?

Now let us move to the next split. Due to the stochastic nature of most of the learning algorithms and the possibly a wildly different joint distribution exhibited by the datapoints in the current split, we might encounter an entirely different picture as can be seen in the figure below:

Figure 2: Both of the estimators are overestimating the true label. The blue dot is the squared error produced by the convex combination strategy. The red dot the average squared error produced by these two estimators.

Here, both f1 and f2 are overestimating the true label. Choosing f2 based on the previous fold would result into a disaster here since the squared error produced by the estimator f1 looks much better in this instance. Also, there exist no convex combination that produce error lower than the best performing estimator (in this case, f1), unlike the previous split.

Enter Jensen’s Inequality

We cannot help but notice that the error curve is blissfully convex. For such convex functions we can apply Jensen’s inequality [1]. Jensen’s inequality says that, no matter in which region we operate along this curve, the red dot 🔴 will always remain above the blue dot 🔵. In other words, convex combination (read the weighted average) of the squared errors produced by f1 and f2 will always remain higher than the error produced by the convex combination of the estimators, i.e, a*f1 +b* f2.

Based on the findings so far, let us pause and ponder a moment. We now have two strategies so far when it comes to ensembling with heterogeneous base learners. On one end, we can select a single estimator based on the cross-validation. On the other end, we can use both estimators via a convex combination. The benefit provided by the latter is much like an insurance against picking the worse estimator (see Figure 2). Whenever it gets the chance, it yields stronger predictive power than both of the candidate base learners (see Figure 1). From that perspective, it is an insurance that does not wait for a disaster to come handy. How frequently it gets that chance is an intriguing discussion, however, let us keep things simple for now.

One can argue that we are reaping the benefit of convexity because of our choice of loss function (which is the squared error in this example). What if the business need requires a loss function that is not squared error? Reality is, most of the practical loss functions you will be going to experience in business (for example: log loss, absolute error, entropy) yield convexity with varying degree of smoothness.

Where super learning fits in and why it is not a silver bullet

Now that we have got the basics of ensembling right, we move our focus to the core topic of this article. Super learning was first introduced under the name Stacking Generalization in the Neural Network community back in 1994 [2], however, its asymptotic optimality was proven in 2007 [3] where it got its name. As a form of meta learning, super learning seeks to ensemble a diverse set of learners. Doing so, it inherits the unique aspects of the data learned by each of the base learners and provides the maximum generalization accuracy.

Let (X,y) denote the training set. Without getting into too much details, the super learning procedure can be boiled down to the following components:

  • Base learner layer : The base learners fit on the training data and predictions are done on the k-fold splits. The predictions are complied to be consumed by the next layer, which can either be a meta learner layer or an intermediate layer. In general, the training set (X,y) is mapped into a prediction set (Z,y) where the the data points in Z are constructed using k-fold splits.
  • Meta learner layer : From the mapped set (Z,y), the meta learner learns the mapping from Z to y. One can use a much simpler technique like linear regression in this layer because the heavy lifting of approximating the mapping between X and y is already encoded into the prediction set Z.

Before getting our hands dirty with super learner, we need to understand some caveats. Super learning (or ensemble in general) is not a silver bullet that enhances the generalization accuracy in an agnostic fashion. In my experience, super learning squeezes out better generalization accuracy when the base learners themselves are competitive learners and each offer different types of utilities. If the performance of the underlying base learners exhibits high bias or high variance, then boosting and bagging are the best choices, respectively. At the same time, if one of the base learners stands out consistently across all k-fold split, there is a very high chance that the super learner will not offer any enhanced performance that justifies the additional work done to build the framework.

Enter ML-Ensemble

Let’s get our hand dirty with MLens. MLens (or ML-Ensemble) is a fantastic framework authored by Sebastian Flennerhag from DeepMind with a goal of creating complex super learners at scale. One of the best things about MLens is the amount of control it offers via a very high level API. It also has native support for timeseries cross-validation hence you do not need to worry about data leakage while working with timeseries prediction problems. I can keep fanboying about MLens all day, so let us put that aside and roll our sleeves to code.

We start with importing the necessary moduels, assuming that mlens has been installed (or do a pip3 install mlens).

Code snippet #1

Next, we create a synthetic dataset with a healthy dose of non-linearity. To be specific, the target variable is a set of noisy observations produced by combination of linear and non-linear functions. Plot twist: the dataset produced by make_regression corresponds to a straight forward linear equation. To make things a little hard for the models, I have injected some arbitrary non-linearity in it (line 9–11).

Code snippet #2

The next step is to create a set of estimators which will act as the base learners and train them individually using the training set (X_train, y_train). We intentionally include base learners that employ completely different strategies to approximate the function (or learn the model). The goal is to learn the idiosyncrasies presented by the training set using a diverse set of estimators. Then, we use them for prediction over unseen test set X_train. Note that, no attempt of hyperparameter tuning was made since it is for demonstration purpose only. We are collecting our predictions in the predictions dataframe (see line 22, code snippet #3).

Code snippet #3

We are all set to create our super learner. Below is a very basic demonstration on how to create a simple super learner with a base layer that contains our base learners followed by a meta learner. First step is to choose an ensemble class and a scoring function for the k-fold cross-validation. MLens offers 4 types of ensembling classes, each catering certain needs. They are:

  • Super Learner
  • Subsemble
  • Blend Ensemble
  • Sequential Ensemble

In this article, we will focus on the Super Learner and the Sequential Ensemble class. But it is worth noting the brilliance of the Subsemble class. Not all estimators scale politely as the dataset grow. Yes Kernel SVM, I am looking at you because you have a notorious reputation for not scaling up, thanks to the underlying quadratic program. This is where Subsemble really shines. It partitions the dataset and fit the estimators independently in each partition which opens up further opportunities of parallelization and concurrency. In this way it can deliver performance on par with the other classes at a fraction of training time while enjoying the same theoretical guarantees as the super learner [3] under some certain conditions.

The second step is to choose a scoring function. MLens offers a wide array of scoring functions, covering all major categories like Root Mean Squared Error (RMSE) and F1 score. However, the business needs might require quirky scoring functions. For example, you might be forecasting demands for a certain inventory where overestimation of the demands are penalized differently than the underestimation of the demand. In cases like these, we can create our own custom scoring function and wrap it using the make_scorer (see line 10–11) to ensure that the scoring function complies with the requirements of MLens regardless of what’s happening under the hood.

After instantiating the super learner, we add a list of base learners. In this example we have the following estimators as the base learners (see code snippet #3):

  • linreg (LinearRegression)
  • knnreg (K Nearest Neighbor Regression)
  • svm (Support Vector Machine)
  • svm-rbf (Kernel Support Vector Machine with Radial Basis Function)
  • rf (Random Forest)

In line 16–17 we add create the 0-th layer of base learners and the 1st layer which is a meta learner based on simple linear regression. The rest follows the standard Scikit Learn pipeline: fit then predict (line 22–23).

Code snippet #4

Using the Sequential Ensemble class, we can introduce intermediate layers between the base learner layer and the meta learning layer. In this example, I have inserted a new layer (line 11 in code snippet #5) with linear regression and support vector machine.

Code snippet #5

Now, let us have a look at the RMSE performance of the base learners and the super learners as shown in the code snippet #6. We can see that there exists a set of quite competitive learners in our experiment. Linear regression, SVM and random forest has reported compelling performances. K Nearest Neighbor regression (knnreg) was not appealing though while kernel SVM failed catastrophically. But the clear winners here are the super learners. Especially, the 2-layer super learner yielded almost 6% improvement over the best base learner (random forest).

Code snippet #6

Here is an interactive plot of the performances of the base learners and the super learners against the true values y_test (denoted as the True value). You can toggle each variables to have an in-depth look into how each of them performed and how the super learner combined each of the base learners strength to maximize the generalization accuracy.

Introducing custom estimators in the super learner

Let’s say, you have got some domain expertise. And those expertise are not captured by the off-the-shelf estimators offered by Scikit Learn. So you rolled your sleeve and wrote you own estimator that does some fancy math under the hood. You would like to include your custom estimator in the super learner. How to do it?

The trick is to make your estimator Scikit Learn compliant. It is a healthy practice anyway. It enables you to treat your estimator as a Scikit Learn estimator locally. Eventually it unleashes a host of tools in the Scikit Learn ecosystem. For example, instead of writing a chunky snippet, you can directly use the GridSearchCV for automated parameters tuning for your custom estimator. Also you can include your model in a Pipelineand expect it to behave alright.

Below is a demo of how we can inherit the BaseEstimatorclass and make your custom estimator Scikit Learn compliant. The class will have two methods: fit and predict. The fit method will contain the custom estimator and other heavy lifting. The check_X_y method checks if the dimension of the supplied numpy arrays are appropriate. The is_fitted attribute is a flag to prevent using predict before fitting the model. One notable quirk is, the fit method has to return self.

Conclusion

Super learners are not a sure-fire way to ensure the maximization of generalization accuracy. But it certainly does that with high probability. Especially when you have an array of competitive estimators, the super learners often squeeze out some more performance out of your estimators. MLens allows us to carry out rapid experimentation with super learning, that too at scale. I hope the article equipped you with some understanding on how to approach to super learning.

Further reads if you are interested:

[1] Jensen’s Inequality

[2] Stacking Generalization

[3] Super Learning

[4] Subsembling

--

--

Shah Mahdi Hasan

Senior Machine Learning Scientist, PhD in Statistical Signal Processing from UoN. Specialized in optimization, DL and stats. Jazz nerd and applied ML fiend.