Modeling Ensembles with R and {caret}
Practical walkthroughs on machine learning, data exploration and finding insight.
Resources
Packages Used in this Walkthrough
- {caret} - modeling wrapper, functions, commands
- {RCurl} - web data downloading functions
- {pROC} - Area Under the Curve (AUC) functions
There are many reasons to ensemble models but it usually comes down to capturing a deeper understanding of high dimensionality data. The more complex a data set, the more it benefits from additional models, just like additional eyes, to capture more nuances scattered around high dimensionality data.
Let’s code!
This walkthrough leverages the caret package for ease of coding but the concept applies to any model in any statistical programming language. Caret allows you to easily switch models in a script without having to change much of the code. You can easily write a loop and have it run through the almost 170 models that the package currently supports (Max Kuhn keeps adding new ones) by only changing one variable.
To get a complete list of the models supported by caret use the getModelInfo
function:
library(caret)
names(getModelInfo())
## [1] "ada" "ANFIS" "avNNet"
## [4] "bag" "bagEarth" "bagFDA"
## [7] "bayesglm" "bdk" "blackboost"
## [10] "Boruta" "brnn" "bstLs"
## [13] "bstSm" "bstTree" "C5.0"
## [16] "C5.0Cost" "C5.0Rules" "C5.0Tree"
## [19] "cforest" "CSimca" "ctree"
## [22] "ctree2" "cubist" "DENFIS"
## [25] "dnn" "earth" "elm"
## [28] "enet" "evtree" "extraTrees"
## [31] "fda" "FH.GBML" "FIR.DM"
## [34] "foba" "FRBCS.CHI" "FRBCS.W"
## [37] "FS.HGD" "gam" "gamboost"
## [40] "gamLoess" "gamSpline" "gaussprLinear"
## [43] "gaussprPoly" "gaussprRadial" "gbm"
## [46] "gcvEarth" "GFS.FR.MOGAL" "GFS.GCCL"
## [49] "GFS.LT.RS" "GFS.Thrift" "glm"
## [52] "glmboost" "glmnet" "glmStepAIC"
## [55] "gpls" "hda" "hdda"
## [58] "HYFIS" "icr" "J48"
## [61] "JRip" "kernelpls" "kknn"
## [64] "knn" "krlsPoly" "krlsRadial"
## [67] "lars" "lars2" "lasso"
## [70] "lda" "lda2" "leapBackward"
## [73] "leapForward" "leapSeq" "Linda"
## [76] "lm" "lmStepAIC" "LMT"
## [79] "logicBag" "LogitBoost" "logreg"
## [82] "lssvmLinear" "lssvmPoly" "lssvmRadial"
## [85] "lvq" "M5" "M5Rules"
## [88] "mda" "Mlda" "mlp"
## [91] "mlpWeightDecay" "multinom" "nb"
## [94] "neuralnet" "nnet" "nodeHarvest"
## [97] "oblique.tree" "OneR" "ORFlog"
## [100] "ORFpls" "ORFridge" "ORFsvm"
## [103] "pam" "parRF" "PART"
## [106] "partDSA" "pcaNNet" "pcr"
## [109] "pda" "pda2" "penalized"
## [112] "PenalizedLDA" "plr" "pls"
## [115] "plsRglm" "ppr" "protoclass"
## [118] "qda" "QdaCov" "qrf"
## [121] "qrnn" "rbf" "rbfDDA"
## [124] "rda" "relaxo" "rf"
## [127] "rFerns" "RFlda" "ridge"
## [130] "rknn" "rknnBel" "rlm"
## [133] "rocc" "rpart" "rpart2"
## [136] "rpartCost" "RRF" "RRFglobal"
## [139] "rrlda" "RSimca" "rvmLinear"
## [142] "rvmPoly" "rvmRadial" "SBC"
## [145] "sda" "sddaLDA" "sddaQDA"
## [148] "simpls" "SLAVE" "slda"
## [151] "smda" "sparseLDA" "spls"
## [154] "stepLDA" "stepQDA" "superpc"
## [157] "svmBoundrangeString" "svmExpoString" "svmLinear"
## [160] "svmPoly" "svmRadial" "svmRadialCost"
## [163] "svmRadialWeights" "svmSpectrumString" "treebag"
## [166] "vbmpRadial" "widekernelpls" "WM"
## [169] "xyf"
As you can see, there are plenty of models available to satisfy most needs. Some support either dual use, while others are either classification or regression only. You can test for the type a model supports by using the same getModelInfo
function:
getModelInfo()$glm$type
# "Regression" "Classification"
In the above snippet, glm
supports both regression and classification.
We download the vehicles data set from Hadley Wickham hosted on Github. To keep this simple, we attempt to predict whether a vehicle has 6 cylinders using only the first 24 columns of the data set:
library(RCurl)
urlfile <-'https://raw.githubusercontent.com/hadley/fueleconomy/master/data-raw/vehicles.csv'
x <- getURL(urlfile, ssl.verifypeer = FALSE)
vehicles <- read.csv(textConnection(x))
# alternative way of getting the data if the above snippet doesn't work:
# urlData <- getURL('https://raw.githubusercontent.com/hadley/fueleconomy/master/data-raw/vehicles.csv')
# vehicles <- read.csv(text = urlData)
We clean the outcome variable cyclinders
by assigning it 1
for 6 cyclinders and 0
for everything else:
vehicles <- vehicles[names(vehicles)[1:24]]
vehicles <- data.frame(lapply(vehicles, as.character), stringsAsFactors=FALSE)
vehicles <- data.frame(lapply(vehicles, as.numeric))
vehicles[is.na(vehicles)] <- 0
vehicles$cylinders <- ifelse(vehicles$cylinders == 6, 1,0)
We call prop.table
to understand the proportions of our outcome variable:
prop.table(table(vehicles$cylinders))
## 0 1
## 0.6506 0.3494
This tells us that 35% of the data represents a vehicle with 6 cylinders. So our data isn’t perfectly balanced but it certainly isn’t skewed or considered a rare event.
Here is the one complicated part, instead of splitting the data into 2 parts of train
and test
, we split the data into 3 parts: ensembleData
, blenderData
, and testingData
:
set.seed(1234)
vehicles <- vehicles[sample(nrow(vehicles)),]
split <- floor(nrow(vehicles)/3)
ensembleData <- vehicles[0:split,]
blenderData <- vehicles[(split+1):(split*2),]
testingData <- vehicles[(split*2+1):nrow(vehicles),]
We assign the outcome name to labelName
and the predictor variables to predictors
:
labelName <- 'cylinders'
predictors <- names(ensembleData)[names(ensembleData) != labelName]
We create a caret trainControl
object to control the number of cross-validations performed (the more the better but for breivity we only require 3):
myControl <- trainControl(method='cv', number=3, returnResamp='none')
Benchmark Model
We run the data on a gbm
model without any enembling to use as a comparative benchmark:
test_model <- train(blenderData[,predictors], blenderData[,labelName], method='gbm', trControl=myControl)
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.2147 nan 0.1000 0.0128
## 2 0.2044 nan 0.1000 0.0104
## 3 0.1962 nan 0.1000 0.0084
...
You’ll see a series of the lines (as shown above) as it trains thegbm
model.
We then use the model to predict 6-cylinder vehicles using the testingData
data set and pROC’s auc
function to get the Area Under the Curve (AUC):
preds <- predict(object=test_model, testingData[,predictors])
library(pROC)
auc <- roc(testingData[,labelName], preds)
print(auc$auc)
## Area under the curve: 0.99
It gives a fairly strong AUC score of 0.99 (remember that 0.5 is random and 1 is perfect). Hard to believe we can improve this score using an ensemble of models…
Ensembles
But we’re going to try. We now use 3 models - gbm
, rpart
, and treebag
as part of our ensembles of models and train
them with the ensembleData
data set:
model_gbm <- train(ensembleData[,predictors], ensembleData[,labelName], method='gbm', trControl=myControl)
model_rpart <- train(ensembleData[,predictors], ensembleData[,labelName], method='rpart', trControl=myControl)
model_treebag <- train(ensembleData[,predictors], ensembleData[,labelName], method='treebag', trControl=myControl)
After our 3 models are trained, we use them to predict 6 cylinder vehicles on the other two data sets: blenderData
and testingData
- yes, both!! We need to do this to harvest the predictions from both data sets as we’re going to add those predictions as new features to the same data sets. So, as we have 3 models, we’re going to add three new columns to both blenderData
and testingData
:
blenderData$gbm_PROB <- predict(object=model_gbm, blenderData[,predictors])
blenderData$rf_PROB <- predict(object=model_rpart, blenderData[,predictors])
blenderData$treebag_PROB <- predict(object=model_treebag, blenderData[,predictors])
testingData$gbm_PROB <- predict(object=model_gbm, testingData[,predictors])
testingData$rf_PROB <- predict(object=model_rpart, testingData[,predictors])
testingData$treebag_PROB <- predict(object=model_treebag, testingData[,predictors])
Please note how easy it is to add those values back to the original data set (follow where we assign the resulting predictions above).
Now we train a final blending model on the old data and the new predictions (we use gbm
but that is completely arbitrary):
predictors <- names(blenderData)[names(blenderData) != labelName]
final_blender_model <- train(blenderData[,predictors], blenderData[,labelName], method='gbm', trControl=myControl)
And we call predict
and roc
/auc
functions to see how our blended ensemble model fared:
preds <- predict(object=final_blender_model, testingData[,predictors])
auc <- roc(testingData[,labelName], preds)
print(auc$auc)
## Area under the curve: 0.993
There you have it, an AUC bump of 0.003. This may not seem like much (and there are many ways of improving that score) but it should easily give you that extra edge on your next data science competition!!
Full source code (also on GitHub):
library(caret)
names(getModelInfo())
# Load data from Hadley Wickham on Github - Vehicle data set and predict 6 cylinder vehicles
library(RCurl)
#urlData <- getURL('https://raw.githubusercontent.com/hadley/fueleconomy/master/data-raw/vehicles.csv')
#vehicles <- read.csv(text = urlData)
# alternative way of getting the data
urlfile <-'https://raw.githubusercontent.com/hadley/fueleconomy/master/data-raw/vehicles.csv'
x <- getURL(urlfile, ssl.verifypeer = FALSE)
vehicles <- read.csv(textConnection(x))
# clean up the data and only use the first 24 columns
vehicles <- vehicles[names(vehicles)[1:24]]
vehicles <- data.frame(lapply(vehicles, as.character), stringsAsFactors=FALSE)
vehicles <- data.frame(lapply(vehicles, as.numeric))
vehicles[is.na(vehicles)] <- 0
vehicles$cylinders <- ifelse(vehicles$cylinders == 6, 1,0)
prop.table(table(vehicles$cylinders))
# shuffle and split the data into three parts
set.seed(1234)
vehicles <- vehicles[sample(nrow(vehicles)),]
split <- floor(nrow(vehicles)/3)
ensembleData <- vehicles[0:split,]
blenderData <- vehicles[(split+1):(split*2),]
testingData <- vehicles[(split*2+1):nrow(vehicles),]
# set label name and predictors
labelName <- 'cylinders'
predictors <- names(ensembleData)[names(ensembleData) != labelName]
library(caret)
# create a caret control object to control the number of cross-validations performed
myControl <- trainControl(method='cv', number=3, returnResamp='none')
# quick benchmark model
test_model <- train(blenderData[,predictors], blenderData[,labelName], method='gbm', trControl=myControl)
preds <- predict(object=test_model, testingData[,predictors])
library(pROC)
auc <- roc(testingData[,labelName], preds)
print(auc$auc) # Area under the curve: 0.9896
# train all the ensemble models with ensembleData
model_gbm <- train(ensembleData[,predictors], ensembleData[,labelName], method='gbm', trControl=myControl)
model_rpart <- train(ensembleData[,predictors], ensembleData[,labelName], method='rpart', trControl=myControl)
model_treebag <- train(ensembleData[,predictors], ensembleData[,labelName], method='treebag', trControl=myControl)
# get predictions for each ensemble model for two last data sets
# and add them back to themselves
blenderData$gbm_PROB <- predict(object=model_gbm, blenderData[,predictors])
blenderData$rf_PROB <- predict(object=model_rpart, blenderData[,predictors])
blenderData$treebag_PROB <- predict(object=model_treebag, blenderData[,predictors])
testingData$gbm_PROB <- predict(object=model_gbm, testingData[,predictors])
testingData$rf_PROB <- predict(object=model_rpart, testingData[,predictors])
testingData$treebag_PROB <- predict(object=model_treebag, testingData[,predictors])
# see how each individual model performed on its own
auc <- roc(testingData[,labelName], testingData$gbm_PROB )
print(auc$auc) # Area under the curve: 0.9893
auc <- roc(testingData[,labelName], testingData$rf_PROB )
print(auc$auc) # Area under the curve: 0.958
auc <- roc(testingData[,labelName], testingData$treebag_PROB )
print(auc$auc) # Area under the curve: 0.9734
# run a final model to blend all the probabilities together
predictors <- names(blenderData)[names(blenderData) != labelName]
final_blender_model <- train(blenderData[,predictors], blenderData[,labelName], method='gbm', trControl=myControl)
# See final prediction and AUC of blended ensemble
preds <- predict(object=final_blender_model, testingData[,predictors])
auc <- roc(testingData[,labelName], preds)
print(auc$auc) # Area under the curve: 0.9922
Manuel Amunategui - Follow me on Twitter: @amunategui