# Predicting Multiple Discrete Values with Multinomials, Neural Networks and the {nnet} Package

Practical walkthroughs on machine learning, data exploration and finding insight.

**Resources**

**Packages Used in this Walkthrough**

**{nnet}**- neural network multinomial modeling**{RCurl}**- downloads https data**{caret}**- dummyVars and postResample function

So, what is a **multionmial model**?

From Wikipedia:

Multinomial logistic regression is a simple extension of binary logistic regression that allows for more than two categories of the dependent or outcome variable.

And from the `multinom`

**{nnet}** help file:

```
library(nnet)
?multinom
```

Fits multinomial log-linear models via neural networks.

In a nutshell, this allows you to predict a factor of multiple levels (more than two) in one shot with the power of neural networks. **Neural networks** are great at working through multiple combinations and also great with linear models, so it’s an ideal combination.

If your data is linear in nature, then instead of using multiple models and doing `A`

versus `B`

, `B`

versus `C`

, and `C`

versus `A`

, and finally going through the hassle of concatenating the resulting probabilities, you can let **nnet** do it all in one shot. And this becomes exponentialy more difficult as you predict more than 3 outcome levels!!

The `multinom`

function will do all that for you in one shot and allow you to observe the probabilities of each subset to interpret things (now that’s really cool).

**Let’s code!**

We’re going to use a Hadley Wickham data set to predict how many cylinders a vehicle has. We download the data from Github:

```
library(RCurl)
urlfile <-'https://raw.githubusercontent.com/hadley/fueleconomy/master/data-raw/vehicles.csv'
x <- getURL(urlfile, ssl.verifypeer = FALSE)
vehicles <- read.csv(textConnection(x))
```

Only use the first 24 columns of the data for simplicities sake. Cast all variables to numerics and impute any NAs with `0`

.

```
vehicles <- vehicles[names(vehicles)[1:24]]
vehicles[is.na(vehicles)] <- 0
names(vehicles)
```

```
## [1] "barrels08" "barrelsA08" "charge120"
## [4] "charge240" "city08" "city08U"
## [7] "cityA08" "cityA08U" "cityCD"
## [10] "cityE" "cityUF" "co2"
## [13] "co2A" "co2TailpipeAGpm" "co2TailpipeGpm"
## [16] "comb08" "comb08U" "combA08"
## [19] "combA08U" "combE" "combinedCD"
## [22] "combinedUF" "cylinders" "displ"
```

Use the `cyclinder`

column as the model’s outcome and cast it to a factor. Use the `table`

function to see how many types of cylinders we are dealing with (BTW a `0`

cylinder vehicle is an electric vehicle):

```
vehicles$cylinders <- as.factor(vehicles$cylinders)
table(vehicles$cylinders)
```

```
##
## 0 2 3 4 5 6 8 10 12 16
## 66 51 182 13133 757 12101 7715 138 481 7
```

We see that the 4 and 6 cylinder vehicles are the most numerous.

Shuffle the data and split it into two equal data frames so we can have a training and a testing data set:

```
set.seed(1234)
vehicles <- vehicles[sample(nrow(vehicles)),]
split <- floor(nrow(vehicles)/2)
vehiclesTrain <- vehicles[0:split,]
vehiclesTest <- vehicles[(split+1):nrow(vehicles),]
```

Let’s put **nnet** to work and predict cyclinders. The `maxiter`

variable defaults to 100 when omitted so let’s start with a large number during the first round to make sure we find the lowest possible error level (i.e. global minimum - solution with the lowest error possible):

```
library(nnet)
cylModel <- multinom(cylinders~., data=vehiclesTrain, maxit=500, trace=T)
```

```
## # weights: 250 (216 variable)
## initial value 39869.260885
## iter 10 value 18697.133750
...
## iter 420 value 5217.401201
## final value 5217.398483
## converged
```

When you see the word **converged** in the log output, you know the model went as far as it could.

Let’s find the most influential variables by using **caret’s** `varImp`

function:

```
library(caret)
mostImportantVariables <- varImp(cylModel)
mostImportantVariables$Variables <- row.names(mostImportantVariables)
mostImportantVariables <- mostImportantVariables[order(-mostImportantVariables$Overall),]
print(head(mostImportantVariables))
```

```
## Variables Overall
## charge240 625.5732
## cityUF 596.4079
## combinedUF 580.1112
## displ 434.8038
## cityE 395.3533
## combA08 322.2910
```

Next we predict `cylinders`

using the `predict`

function on the testing data set. There are two ways to compute predictions, `class`

or `probs`

:

```
preds1 <- predict(cylModel, type="probs", newdata=vehiclesTest)
head(preds1)
```

```
## 0 2 3 4 5 6
## 30632 1.966e-53 2.770e-38 3.232e-40 3.024e-02 4.728e-02 0.9222608
## 26204 4.310e-33 8.316e-112 5.884e-21 9.297e-01 3.009e-02 0.0401936
## 27378 1.211e-92 7.384e-65 5.948e-75 7.352e-09 2.823e-05 0.2919979
## 12346 2.808e-48 9.065e-29 3.301e-37 4.767e-02 9.357e-02 0.8580807
## 13664 2.428e-27 4.287e-33 8.640e-21 9.643e-01 2.190e-02 0.0137845
## 3357 1.654e-146 2.447e-119 2.731e-111 3.014e-17 2.733e-10 0.0004251
## 8 10 12 16
## 30632 2.198e-04 4.059e-09 2.837e-09 2.457e-89
## 26204 1.523e-06 6.732e-13 3.522e-16 4.859e-95
## 27378 7.019e-01 3.665e-03 2.452e-03 9.845e-46
## 12346 6.820e-04 9.299e-08 1.057e-07 3.385e-87
## 13664 7.224e-08 6.902e-14 7.690e-15 1.406e-111
## 3357 8.994e-01 2.395e-02 7.627e-02 4.517e-45
```

```
preds2 <- predict(cylModel, type="class", newdata=vehiclesTest)
head(preds2)
```

```
## [1] 6 4 8 6 4 8
## Levels: 0 2 3 4 5 6 8 10 12 16
```

Choosing which of the two predictions will depend on your needs. If you just want your `cylinders`

predictions, use `class`

, if you need to do anything more complex, like measure the conviction of each prediction, use the `probs`

option (every row will add up to 1).

To check the **accuracy** of the model, we call the `postResample`

function from **caret**. For numeric vectors, it uses the mean squared error and R-squared and for factors, the overall agreement rate and Kappa:

```
postResample(vehiclesTest$cylinders,preds2)
```

```
## Accuracy Kappa
## 0.9034 0.8566
```

As a bonus, lets do some simple repeated cross validation to get a more comprehensive mean accuracy score and understand convergence. The code below will iterate through all the data to give every variable a chance of being **test** and **train** data sets. The first time around we set `maxit`

to only **50**:

```
totalAccuracy <- c()
cv <- 10
cvDivider <- floor(nrow(vehicles) / (cv+1))
for (cv in seq(1:cv)) {
# assign chunk to data test
dataTestIndex <- c((cv * cvDivider):(cv * cvDivider + cvDivider))
dataTest <- vehicles[dataTestIndex,]
# everything else to train
dataTrain <- vehicles[-dataTestIndex,]
cylModel <- multinom(cylinders~., data=dataTrain, maxit=50, trace=T)
pred <- predict(cylModel, newdata=dataTest, type="class")
# classification error
cv_ac <- postResample(dataTest$cylinders, pred)[[1]]
print(paste('Current Accuracy:',cv_ac,'for CV:',cv))
totalAccuracy <- c(totalAccuracy, cv_ac)
}
```

```
## stopped after 50 iterations
## [1] "Current Accuracy: 0.62559542711972 for CV: 1"
...
## stopped after 50 iterations
## [1] "Current Accuracy: 0.650682756430613 for CV: 2"
...
## stopped after 50 iterations
## [1] "Current Accuracy: 0.613210543029533 for CV: 3"
...
## stopped after 50 iterations
## [1] "Current Accuracy: 0.657669101302001 for CV: 4"
...
## stopped after 50 iterations
## [1] "Current Accuracy: 0.607494442680216 for CV: 5"
...
## stopped after 50 iterations
## [1] "Current Accuracy: 0.644649094950778 for CV: 6"
...
## stopped after 50 iterations
## [1] "Current Accuracy: 0.70593839314068 for CV: 7"
...
## stopped after 50 iterations
## [1] "Current Accuracy: 0.621149571292474 for CV: 8"
...
## stopped after 50 iterations
## [1] "Current Accuracy: 0.59892029215624 for CV: 9"
...
## stopped after 50 iterations
## [1] "Current Accuracy: 0.62432518259765 for CV: 10"
```

```
mean(totalAccuracy)
```

```
## [1] 0.635
```

The **mean accuracy** of **0.635** is much lower than the accuracy of **0.9034** that we got with the original simple split. You will notice that the log output never prints the word **converged**. This means the model never reaches the lowest error or global minima and therefore isn’t the best fit.

Let’s try this again and let the model converge by setting the `maxit`

to a large number

```
totalAccuracy <- c()
cv <- 10
cvDivider <- floor(nrow(vehicles) / (cv+1))
for (cv in seq(1:cv)) {
# assign chunk to data test
dataTestIndex <- c((cv * cvDivider):(cv * cvDivider + cvDivider))
dataTest <- vehicles[dataTestIndex,]
# everything else to train
dataTrain <- vehicles[-dataTestIndex,]
cylModel <- multinom(cylinders~., data=dataTrain, maxit=1000, trace=T)
pred <- predict(cylModel, newdata=dataTest, type="class")
# classification error
cv_ac <- postResample(dataTest$cylinders, pred)[[1]]
print(paste('Current Accuracy:',cv_ac,'for CV:',cv))
totalAccuracy <- c(totalAccuracy, cv_ac)
}
```

```
## converged
## [1] "Current Accuracy: 0.905366783105748 for CV: 1"
## converged
## [1] "Current Accuracy: 0.89838043823436 for CV: 2"
## converged
## [1] "Current Accuracy: 0.909812638932995 for CV: 3"
## converged
## [1] "Current Accuracy: 0.904096538583677 for CV: 4"
## converged
## [1] "Current Accuracy: 0.906637027627818 for CV: 5"
## converged
## [1] "Current Accuracy: 0.906001905366783 for CV: 6"
## converged
## [1] "Current Accuracy: 0.911400444585583 for CV: 7"
## converged
## [1] "Current Accuracy: 0.902508732931089 for CV: 8"
## converged
## [1] "Current Accuracy: 0.90377897745316 for CV: 9"
## converged
## [1] "Current Accuracy: 0.904414099714195 for CV: 10"
```

```
mean(totalAccuracy)
## [1] 0.9052
```

The score using the **repeated cross validation** code is better than the original simple split of **0.9304** and we let each loop converge. The point of using the **repeated cross validation** code isn’t that it will return a higher accuracy score (and it doesn’t always) but that it will give you a much more accurate score as it uses all of your data.

Full source code (also on GitHub):

```
library(nnet)
?multinom
library(caret)
library(RCurl)
library(Metrics)
#####################################################################
# Load data from Hadley Wickham on Github
urlfile <-'https://raw.githubusercontent.com/hadley/fueleconomy/master/data-raw/vehicles.csv'
x <- getURL(urlfile, ssl.verifypeer = FALSE)
vehicles <- read.csv(textConnection(x))
# clean up the data and only use the first 24 columns
vehicles <- vehicles[names(vehicles)[1:24]]
vehicles[is.na(vehicles)] <- 0
# use cyclinder column as outcome and cast to factor
vehicles$cylinders <- as.factor(vehicles$cylinders)
table(vehicles$cylinders)
#####################################################################
# shuffle and split
set.seed(1234)
vehicles <- vehicles[sample(nrow(vehicles)),]
split <- floor(nrow(vehicles)/2)
vehiclesTrain <- vehicles[0:split,]
vehiclesTest <- vehicles[(split+1):nrow(vehicles),]
# see how multinom predicts cylinders
# set the maxit to a large number, enough so the neural net can converge to smallest error
cylModel <- multinom(cylinders~., data=vehiclesTrain, maxit=500, trace=T)
# Sort by most influential variables
topModels <- varImp(cylModel)
topModels$Variables <- row.names(topModels)
topModels <- topModels[order(-topModels$Overall),]
# class/probs (best option, second best option?)
preds1 <- predict(cylModel, type="probs", newdata=vehiclesTest)
preds2 <- predict(cylModel, type="class", newdata=vehiclesTest)
# resample for accuracy - the mean squared error and R-squared are calculated of forfactors, the overall agreement rate and Kappa
postResample(vehiclesTest$cylinders,preds2)[[1]]
library(Metrics)
classificationError <- ce(as.numeric(vehiclesTest$cylinders), as.numeric(preds2))
# repeat cross validate by iterating through all the data to give every variable a chance of being test and train portions
totalError <- c()
cv <- 10
maxiterations <- 500 # try it again with a lower value and notice the mean error
cvDivider <- floor(nrow(vehiclesTrain) / (cv+1))
for (cv in seq(1:cv)) {
# assign chunk to data test
dataTestIndex <- c((cv * cvDivider):(cv * cvDivider + cvDivider))
dataTest <- vehiclesTrain[dataTestIndex,]
# everything else to train
dataTrain <- vehiclesTrain[-dataTestIndex,]
cylModel <- multinom(cylinders~., data=dataTrain, maxit=maxiterations, trace=T)
pred <- predict(cylModel, dataTest)
# classification error
err <- ce(as.numeric(dataTest$cylinders), as.numeric(pred))
totalError <- c(totalError, err)
}
print(paste('Mean error of all repeated cross validations:',mean(totalError)))
```