# Modeling Ensembles with R and {caret}

Practical walkthroughs on machine learning, data exploration and finding insight.

**Resources**

**Packages Used in this Walkthrough**

**{caret}**- modeling wrapper, functions, commands**{RCurl}**- web data downloading functions**{pROC}**- Area Under the Curve (AUC) functions

There are many reasons to ensemble models but it usually comes down to capturing a deeper understanding of high dimensionality data. The more complex a data set, the more it benefits from additional models, just like additional eyes, to capture more nuances scattered around high dimensionality data.

**Let’s code!**

This walkthrough leverages the **caret** package for ease of coding but the concept applies to any model in any statistical programming language. **Caret** allows you to easily switch models in a script without having to change much of the code. You can easily write a loop and have it run through the almost 170 models that the package currently supports (Max Kuhn keeps adding new ones) by only changing one variable.

To get a complete list of the models supported by **caret** use the `getModelInfo`

function:

```
library(caret)
names(getModelInfo())
```

```
## [1] "ada" "ANFIS" "avNNet"
## [4] "bag" "bagEarth" "bagFDA"
## [7] "bayesglm" "bdk" "blackboost"
## [10] "Boruta" "brnn" "bstLs"
## [13] "bstSm" "bstTree" "C5.0"
## [16] "C5.0Cost" "C5.0Rules" "C5.0Tree"
## [19] "cforest" "CSimca" "ctree"
## [22] "ctree2" "cubist" "DENFIS"
## [25] "dnn" "earth" "elm"
## [28] "enet" "evtree" "extraTrees"
## [31] "fda" "FH.GBML" "FIR.DM"
## [34] "foba" "FRBCS.CHI" "FRBCS.W"
## [37] "FS.HGD" "gam" "gamboost"
## [40] "gamLoess" "gamSpline" "gaussprLinear"
## [43] "gaussprPoly" "gaussprRadial" "gbm"
## [46] "gcvEarth" "GFS.FR.MOGAL" "GFS.GCCL"
## [49] "GFS.LT.RS" "GFS.Thrift" "glm"
## [52] "glmboost" "glmnet" "glmStepAIC"
## [55] "gpls" "hda" "hdda"
## [58] "HYFIS" "icr" "J48"
## [61] "JRip" "kernelpls" "kknn"
## [64] "knn" "krlsPoly" "krlsRadial"
## [67] "lars" "lars2" "lasso"
## [70] "lda" "lda2" "leapBackward"
## [73] "leapForward" "leapSeq" "Linda"
## [76] "lm" "lmStepAIC" "LMT"
## [79] "logicBag" "LogitBoost" "logreg"
## [82] "lssvmLinear" "lssvmPoly" "lssvmRadial"
## [85] "lvq" "M5" "M5Rules"
## [88] "mda" "Mlda" "mlp"
## [91] "mlpWeightDecay" "multinom" "nb"
## [94] "neuralnet" "nnet" "nodeHarvest"
## [97] "oblique.tree" "OneR" "ORFlog"
## [100] "ORFpls" "ORFridge" "ORFsvm"
## [103] "pam" "parRF" "PART"
## [106] "partDSA" "pcaNNet" "pcr"
## [109] "pda" "pda2" "penalized"
## [112] "PenalizedLDA" "plr" "pls"
## [115] "plsRglm" "ppr" "protoclass"
## [118] "qda" "QdaCov" "qrf"
## [121] "qrnn" "rbf" "rbfDDA"
## [124] "rda" "relaxo" "rf"
## [127] "rFerns" "RFlda" "ridge"
## [130] "rknn" "rknnBel" "rlm"
## [133] "rocc" "rpart" "rpart2"
## [136] "rpartCost" "RRF" "RRFglobal"
## [139] "rrlda" "RSimca" "rvmLinear"
## [142] "rvmPoly" "rvmRadial" "SBC"
## [145] "sda" "sddaLDA" "sddaQDA"
## [148] "simpls" "SLAVE" "slda"
## [151] "smda" "sparseLDA" "spls"
## [154] "stepLDA" "stepQDA" "superpc"
## [157] "svmBoundrangeString" "svmExpoString" "svmLinear"
## [160] "svmPoly" "svmRadial" "svmRadialCost"
## [163] "svmRadialWeights" "svmSpectrumString" "treebag"
## [166] "vbmpRadial" "widekernelpls" "WM"
## [169] "xyf"
```

As you can see, there are plenty of models available to satisfy most needs. Some support either **dual use**, while others are either **classification** or **regression** only. You can test for the type a model supports by using the same `getModelInfo`

function:

```
getModelInfo()$glm$type
```

```
# "Regression" "Classification"
```

In the above snippet, `glm`

supports both **regression** and **classification**.

We download the **vehicles** data set from Hadley Wickham hosted on Github. To keep this simple, we attempt to predict whether a vehicle has 6 cylinders using only the first 24 columns of the data set:

```
library(RCurl)
urlfile <-'https://raw.githubusercontent.com/hadley/fueleconomy/master/data-raw/vehicles.csv'
x <- getURL(urlfile, ssl.verifypeer = FALSE)
vehicles <- read.csv(textConnection(x))
# alternative way of getting the data if the above snippet doesn't work:
# urlData <- getURL('https://raw.githubusercontent.com/hadley/fueleconomy/master/data-raw/vehicles.csv')
# vehicles <- read.csv(text = urlData)
```

We clean the outcome variable `cyclinders`

by assigning it `1`

for 6 cyclinders and `0`

for everything else:

```
vehicles <- vehicles[names(vehicles)[1:24]]
vehicles <- data.frame(lapply(vehicles, as.character), stringsAsFactors=FALSE)
vehicles <- data.frame(lapply(vehicles, as.numeric))
vehicles[is.na(vehicles)] <- 0
vehicles$cylinders <- ifelse(vehicles$cylinders == 6, 1,0)
```

We call `prop.table`

to understand the proportions of our outcome variable:

```
prop.table(table(vehicles$cylinders))
```

```
## 0 1
## 0.6506 0.3494
```

This tells us that 35% of the data represents a vehicle with 6 cylinders. So our data isn’t perfectly balanced but it certainly isn’t skewed or considered a rare event.

Here is the one complicated part, instead of splitting the data into 2 parts of `train`

and `test`

, we split the data into 3 parts: `ensembleData`

, `blenderData`

, and `testingData`

:

```
set.seed(1234)
vehicles <- vehicles[sample(nrow(vehicles)),]
split <- floor(nrow(vehicles)/3)
ensembleData <- vehicles[0:split,]
blenderData <- vehicles[(split+1):(split*2),]
testingData <- vehicles[(split*2+1):nrow(vehicles),]
```

We assign the outcome name to `labelName`

and the predictor variables to `predictors`

:

```
labelName <- 'cylinders'
predictors <- names(ensembleData)[names(ensembleData) != labelName]
```

We create a **caret** `trainControl`

object to control the number of cross-validations performed (the more the better but for breivity we only require 3):

```
myControl <- trainControl(method='cv', number=3, returnResamp='none')
```

**Benchmark Model**

We run the data on a `gbm`

model without any enembling to use as a comparative benchmark:

```
test_model <- train(blenderData[,predictors], blenderData[,labelName], method='gbm', trControl=myControl)
```

```
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.2147 nan 0.1000 0.0128
## 2 0.2044 nan 0.1000 0.0104
## 3 0.1962 nan 0.1000 0.0084
...
```

You’ll see a series of the lines (as shown above) as it trains the`gbm`

model.

We then use the model to predict 6-cylinder vehicles using the `testingData`

data set and **pROC**’s `auc`

function to get the Area Under the Curve (AUC):

```
preds <- predict(object=test_model, testingData[,predictors])
library(pROC)
auc <- roc(testingData[,labelName], preds)
print(auc$auc)
```

```
## Area under the curve: 0.99
```

It gives a fairly strong **AUC** score of 0.99 (remember that 0.5 is random and 1 is perfect). Hard to believe we can improve this score using an ensemble of models…

**Ensembles**

But we’re going to try. We now use 3 models - `gbm`

, `rpart`

, and `treebag`

as part of our **ensembles** of models and `train`

them with the `ensembleData`

data set:

```
model_gbm <- train(ensembleData[,predictors], ensembleData[,labelName], method='gbm', trControl=myControl)
model_rpart <- train(ensembleData[,predictors], ensembleData[,labelName], method='rpart', trControl=myControl)
model_treebag <- train(ensembleData[,predictors], ensembleData[,labelName], method='treebag', trControl=myControl)
```

After our 3 models are trained, we use them to predict **6 cylinder vehicles** on the other two data sets: `blenderData`

and `testingData`

- **yes, both!!** We need to do this to harvest the predictions from both data sets as we’re going to add those predictions as new features to the same data sets. So, as we have 3 models, we’re going to add three new columns to both `blenderData`

and `testingData`

:

```
blenderData$gbm_PROB <- predict(object=model_gbm, blenderData[,predictors])
blenderData$rf_PROB <- predict(object=model_rpart, blenderData[,predictors])
blenderData$treebag_PROB <- predict(object=model_treebag, blenderData[,predictors])
testingData$gbm_PROB <- predict(object=model_gbm, testingData[,predictors])
testingData$rf_PROB <- predict(object=model_rpart, testingData[,predictors])
testingData$treebag_PROB <- predict(object=model_treebag, testingData[,predictors])
```

Please note how easy it is to add those values back to the original data set (follow where we assign the resulting predictions above).

Now we train a final **blending** model on the old data and the new predictions (we use `gbm`

but that is completely arbitrary):

```
predictors <- names(blenderData)[names(blenderData) != labelName]
final_blender_model <- train(blenderData[,predictors], blenderData[,labelName], method='gbm', trControl=myControl)
```

And we call `predict`

and `roc`

/`auc`

functions to see how our blended ensemble model fared:

```
preds <- predict(object=final_blender_model, testingData[,predictors])
auc <- roc(testingData[,labelName], preds)
print(auc$auc)
```

```
## Area under the curve: 0.993
```

There you have it, an **AUC** bump of 0.003. This may not seem like much (and there are many ways of improving that score) but it should easily give you that extra edge on your next data science competition!!

Full source code (also on GitHub):

```
library(caret)
names(getModelInfo())
# Load data from Hadley Wickham on Github - Vehicle data set and predict 6 cylinder vehicles
library(RCurl)
#urlData <- getURL('https://raw.githubusercontent.com/hadley/fueleconomy/master/data-raw/vehicles.csv')
#vehicles <- read.csv(text = urlData)
# alternative way of getting the data
urlfile <-'https://raw.githubusercontent.com/hadley/fueleconomy/master/data-raw/vehicles.csv'
x <- getURL(urlfile, ssl.verifypeer = FALSE)
vehicles <- read.csv(textConnection(x))
# clean up the data and only use the first 24 columns
vehicles <- vehicles[names(vehicles)[1:24]]
vehicles <- data.frame(lapply(vehicles, as.character), stringsAsFactors=FALSE)
vehicles <- data.frame(lapply(vehicles, as.numeric))
vehicles[is.na(vehicles)] <- 0
vehicles$cylinders <- ifelse(vehicles$cylinders == 6, 1,0)
prop.table(table(vehicles$cylinders))
# shuffle and split the data into three parts
set.seed(1234)
vehicles <- vehicles[sample(nrow(vehicles)),]
split <- floor(nrow(vehicles)/3)
ensembleData <- vehicles[0:split,]
blenderData <- vehicles[(split+1):(split*2),]
testingData <- vehicles[(split*2+1):nrow(vehicles),]
# set label name and predictors
labelName <- 'cylinders'
predictors <- names(ensembleData)[names(ensembleData) != labelName]
library(caret)
# create a caret control object to control the number of cross-validations performed
myControl <- trainControl(method='cv', number=3, returnResamp='none')
# quick benchmark model
test_model <- train(blenderData[,predictors], blenderData[,labelName], method='gbm', trControl=myControl)
preds <- predict(object=test_model, testingData[,predictors])
library(pROC)
auc <- roc(testingData[,labelName], preds)
print(auc$auc) # Area under the curve: 0.9896
# train all the ensemble models with ensembleData
model_gbm <- train(ensembleData[,predictors], ensembleData[,labelName], method='gbm', trControl=myControl)
model_rpart <- train(ensembleData[,predictors], ensembleData[,labelName], method='rpart', trControl=myControl)
model_treebag <- train(ensembleData[,predictors], ensembleData[,labelName], method='treebag', trControl=myControl)
# get predictions for each ensemble model for two last data sets
# and add them back to themselves
blenderData$gbm_PROB <- predict(object=model_gbm, blenderData[,predictors])
blenderData$rf_PROB <- predict(object=model_rpart, blenderData[,predictors])
blenderData$treebag_PROB <- predict(object=model_treebag, blenderData[,predictors])
testingData$gbm_PROB <- predict(object=model_gbm, testingData[,predictors])
testingData$rf_PROB <- predict(object=model_rpart, testingData[,predictors])
testingData$treebag_PROB <- predict(object=model_treebag, testingData[,predictors])
# see how each individual model performed on its own
auc <- roc(testingData[,labelName], testingData$gbm_PROB )
print(auc$auc) # Area under the curve: 0.9893
auc <- roc(testingData[,labelName], testingData$rf_PROB )
print(auc$auc) # Area under the curve: 0.958
auc <- roc(testingData[,labelName], testingData$treebag_PROB )
print(auc$auc) # Area under the curve: 0.9734
# run a final model to blend all the probabilities together
predictors <- names(blenderData)[names(blenderData) != labelName]
final_blender_model <- train(blenderData[,predictors], blenderData[,labelName], method='gbm', trControl=myControl)
# See final prediction and AUC of blended ensemble
preds <- predict(object=final_blender_model, testingData[,predictors])
auc <- roc(testingData[,labelName], preds)
print(auc$auc) # Area under the curve: 0.9922
```

Manuel Amunategui - Follow me on Twitter: @amunategui