Summary
I looked to Kaggle to further practice building predictive models. After optimizing single and ensemble regression techniques, I uncovered ensemble stacking as a method for building a strong predictive model from a collection of weak learners. The outcome is drastic improvements in predictive accuracy. This post will provide an overview of:
- the basics of automating data preparation using caret
- building stacked ensemble modelling using caretEnsemble
- reason through how the various models that I used improve the ensemble predictions
This post will deal less with the specifics of the dataset, and rather provide an overview of how these packages provide easy, flexible, and powerful methods for developing strong predictive models.
Using caret to accelerate data processing and feature selection
The caret package is great for automating data pre-processing, feature selection, and tuning machine learning algorithms. It only takes one glance at the dimensions of the dataset in Kaggle’s Santander Value Prediction Challenge, which starts with over 4000 variables, to realize that tools are needed to quickly identify useless variables, especially those with little or no variance, and normalize highly skewed variables. caret can handle these repetitive actions through the preProcess function. Here we see that caret will evaluate all the variables for near-zero variance (“nzv”) and perform the Box Cox and Yeo-Johnson Power transformations to normalize skewed data.
We will do a little work to impute for missing values. Another Kaggler has done a thorough job of determining good ways to handle missing data in this challenge, so I have copied the method, imputing “none” in some character vectors, zero for some numeric values, and then imputing the mode of all non-NA values in other numeric values.
dim(all.fixmissing)
train.predict <-
all.fixmissing %>%
data.frame() %>%
predict(preProcess(., method=c("nzv", "BoxCox","YeoJohnson")), .)
dim(train.predict)
We can see that caret has cut down on the number of variables because the non-zero variance (nzv) method discards any variables that provide little variance, thus little relevance for modelling.
Modeling
basics of tuning a model with caret
You can tune the parameters of any model included in caret:
control <- trainControl(method = "repeatedcv", number = 10, repeats = 1,
savePredictions = "final",
classProbs = F)
mod <- caret::train(y_train~.,
data = train.df,
trControl = control,
method = 'glm')
mod
trainControl dictates how the model will be evaluated. Here we specify that the model perform repeated cross-validation (method = “repeatedcv”). meaning we will perform k-fold cross-validation multiple times. k-fold cross-validation involves splitting the data in k groups of roughly equal size, training the data repeatedly on all but one of these groups, and then testing the model on the one group that was left out. The test error from the k models is then averaged to provide a final test error to estimate the final model parameters.
Improving predictive accuracy with stacked ensemble models
Fundamental concept of stacked ensemble models
Stacked ensemble models have proven quite successful in winning Kaggle competitions because they leverage the unique strengths of various ML algorithms to build a stronger model. The basic idea is that multiple models are built on the training data (depicted by the three y-hat objects in the figure below, likely a decision tree, SVM, and neural network here). The resulting predicted values from these models, known as the base learner models, are used as the input to a second-level algorithm, often called the meta-model. The meta-model will produce a model that optimizes the predicted values from the base learners, providing a single stronger model because each base learner will often pick up on different patterns in the data.
<img src=“/img/blogs/modelstacking.png” style = “display: block; margin: 0 auto; background-color:white;”;>
Implementation in caretEnsemble
The caretEnsemble package streamlines the process of building and evaluating stacked ensemble models.
To implement a stacked ensemble model, we first have to bundle the trained base learners using caretList. Again, trControl determines how each model will be evaluated. It is important to specify the index for the base learners so that all the models train on the same data partitions. We specify the individual base learners in two ways. First, for any base learners that do not allow for parameter tuning, we provide them as a vector to the methodList parameter of caretList. For variables that allow for parameter tuning in caret, we will list the models individually as a list for “tuneList”. For each model, we first tune the parameters like we saw above, and then specify the best set of parameters as a dataframe in the tuneGrid parameter of the caretModelSpec function. By specifying single values for the model parameters, we are speeding up computation because caretEnsemble will not attempt to optimize the models.
<img src=“/img/blogs/ensemblestack_metrics.jpeg” style = “display: block; margin: 0 auto; background-color:white;”;>
For the meta-model, we again specify the how the model will be evaluated, here stored in the stackControl object. We specify the meta-model as the method to the caretStack function, here a general linear model (glm), and the best model will be selected based on root mean squared error (RMSE). The final model demonstrates a significant improvement in RMSE and adjusted R-squared.
#evaluation control
stackControl <- trainControl(method = "repeatedcv", number = 5, repeats = 5,
savePredictions = "final",
classProbs = F)
# train the meta-model
stack.glm <- caretStack(models, method = 'glm',
metric = "RMSE", trControl = stackControl)
stack.glm
We can then predict values for the test data. The predict value is addition I made to automatically write the predicted values to a csv file for submission. "
predict_values <- function(model, test_data){
predicted <- predict(model, test_data) %>% expm1(.) # to reverse log transformation
predicted.df <- cbind(test_id, predicted) %>%
magrittr::set_colnames(c("Id", "SalePrice"))
# write.table(predicted.df, file = paste0("./outputs/" , Sys.time(), "-predictedvalues", ".csv"),
# sep = ",", col.names = T, row.names = F)
, eval=F}
predict_values(stack.glm, test.df)
We will get a warning telling us that “prediction from a rank-deficient fit may be misleading” when the base learners are highly correlated, meaning they are picking up on the same trends in the data and adding little improvement in predictive accuracy. We can explore which of these base learners are highly correlated through a correlation matrix.
<img src=“/img/blogs/table1.jpeg” style = “display: block; margin: 0 auto; background-color:white;”;>
Understanding the advantages of specific ML algorithms
When trying to improve the ensemble predictive accuracy, it is easy to throw in many base leaners that don’t provide large improvements in accuracy. caretEnsemble tries to warning you when your models are highly correlated and will provide only minor improvements. These minor improvements might matter if you are trying to score higher on Kaggle where small improvements in the evaluation metric can yield improvements in the leader board, but large improvements result from combining base learners that pick up on different patterns in the data.