Predicting Multiple Discrete Values with Multinomials, Neural Networks and the {nnet} Package
Practical walkthroughs on machine learning, data exploration and finding insight.
Resources
Packages Used in this Walkthrough
- {nnet} - neural network multinomial modeling
- {RCurl} - downloads https data
- {caret} - dummyVars and postResample function
So, what is a multionmial model?
From Wikipedia:
Multinomial logistic regression is a simple extension of binary logistic regression that allows for more than two categories of the dependent or outcome variable.
And from the multinom
{nnet} help file:
library(nnet)
?multinom
Fits multinomial log-linear models via neural networks.
In a nutshell, this allows you to predict a factor of multiple levels (more than two) in one shot with the power of neural networks. Neural networks are great at working through multiple combinations and also great with linear models, so it’s an ideal combination.
If your data is linear in nature, then instead of using multiple models and doing A
versus B
, B
versus C
, and C
versus A
, and finally going through the hassle of concatenating the resulting probabilities, you can let nnet do it all in one shot. And this becomes exponentialy more difficult as you predict more than 3 outcome levels!!
The multinom
function will do all that for you in one shot and allow you to observe the probabilities of each subset to interpret things (now that’s really cool).
Let’s code!
We’re going to use a Hadley Wickham data set to predict how many cylinders a vehicle has. We download the data from Github:
library(RCurl)
urlfile <-'https://raw.githubusercontent.com/hadley/fueleconomy/master/data-raw/vehicles.csv'
x <- getURL(urlfile, ssl.verifypeer = FALSE)
vehicles <- read.csv(textConnection(x))
Only use the first 24 columns of the data for simplicities sake. Cast all variables to numerics and impute any NAs with 0
.
vehicles <- vehicles[names(vehicles)[1:24]]
vehicles[is.na(vehicles)] <- 0
names(vehicles)
## [1] "barrels08" "barrelsA08" "charge120"
## [4] "charge240" "city08" "city08U"
## [7] "cityA08" "cityA08U" "cityCD"
## [10] "cityE" "cityUF" "co2"
## [13] "co2A" "co2TailpipeAGpm" "co2TailpipeGpm"
## [16] "comb08" "comb08U" "combA08"
## [19] "combA08U" "combE" "combinedCD"
## [22] "combinedUF" "cylinders" "displ"
Use the cyclinder
column as the model’s outcome and cast it to a factor. Use the table
function to see how many types of cylinders we are dealing with (BTW a 0
cylinder vehicle is an electric vehicle):
vehicles$cylinders <- as.factor(vehicles$cylinders)
table(vehicles$cylinders)
##
## 0 2 3 4 5 6 8 10 12 16
## 66 51 182 13133 757 12101 7715 138 481 7
We see that the 4 and 6 cylinder vehicles are the most numerous.
Shuffle the data and split it into two equal data frames so we can have a training and a testing data set:
set.seed(1234)
vehicles <- vehicles[sample(nrow(vehicles)),]
split <- floor(nrow(vehicles)/2)
vehiclesTrain <- vehicles[0:split,]
vehiclesTest <- vehicles[(split+1):nrow(vehicles),]
Let’s put nnet to work and predict cyclinders. The maxiter
variable defaults to 100 when omitted so let’s start with a large number during the first round to make sure we find the lowest possible error level (i.e. global minimum - solution with the lowest error possible):
library(nnet)
cylModel <- multinom(cylinders~., data=vehiclesTrain, maxit=500, trace=T)
## # weights: 250 (216 variable)
## initial value 39869.260885
## iter 10 value 18697.133750
...
## iter 420 value 5217.401201
## final value 5217.398483
## converged
When you see the word converged in the log output, you know the model went as far as it could.
Let’s find the most influential variables by using caret’s varImp
function:
library(caret)
mostImportantVariables <- varImp(cylModel)
mostImportantVariables$Variables <- row.names(mostImportantVariables)
mostImportantVariables <- mostImportantVariables[order(-mostImportantVariables$Overall),]
print(head(mostImportantVariables))
## Variables Overall
## charge240 625.5732
## cityUF 596.4079
## combinedUF 580.1112
## displ 434.8038
## cityE 395.3533
## combA08 322.2910
Hi there, this is Manuel Amunategui- if you're enjoying the content, find more at ViralML.com
Next we predict cylinders
using the predict
function on the testing data set. There are two ways to compute predictions, class
or probs
:
preds1 <- predict(cylModel, type="probs", newdata=vehiclesTest)
head(preds1)
## 0 2 3 4 5 6
## 30632 1.966e-53 2.770e-38 3.232e-40 3.024e-02 4.728e-02 0.9222608
## 26204 4.310e-33 8.316e-112 5.884e-21 9.297e-01 3.009e-02 0.0401936
## 27378 1.211e-92 7.384e-65 5.948e-75 7.352e-09 2.823e-05 0.2919979
## 12346 2.808e-48 9.065e-29 3.301e-37 4.767e-02 9.357e-02 0.8580807
## 13664 2.428e-27 4.287e-33 8.640e-21 9.643e-01 2.190e-02 0.0137845
## 3357 1.654e-146 2.447e-119 2.731e-111 3.014e-17 2.733e-10 0.0004251
## 8 10 12 16
## 30632 2.198e-04 4.059e-09 2.837e-09 2.457e-89
## 26204 1.523e-06 6.732e-13 3.522e-16 4.859e-95
## 27378 7.019e-01 3.665e-03 2.452e-03 9.845e-46
## 12346 6.820e-04 9.299e-08 1.057e-07 3.385e-87
## 13664 7.224e-08 6.902e-14 7.690e-15 1.406e-111
## 3357 8.994e-01 2.395e-02 7.627e-02 4.517e-45
preds2 <- predict(cylModel, type="class", newdata=vehiclesTest)
head(preds2)
## [1] 6 4 8 6 4 8
## Levels: 0 2 3 4 5 6 8 10 12 16
Choosing which of the two predictions will depend on your needs. If you just want your cylinders
predictions, use class
, if you need to do anything more complex, like measure the conviction of each prediction, use the probs
option (every row will add up to 1).
To check the accuracy of the model, we call the postResample
function from caret. For numeric vectors, it uses the mean squared error and R-squared and for factors, the overall agreement rate and Kappa:
postResample(vehiclesTest$cylinders,preds2)
## Accuracy Kappa
## 0.9034 0.8566
As a bonus, lets do some simple repeated cross validation to get a more comprehensive mean accuracy score and understand convergence. The code below will iterate through all the data to give every variable a chance of being test and train data sets. The first time around we set maxit
to only 50:
totalAccuracy <- c()
cv <- 10
cvDivider <- floor(nrow(vehicles) / (cv+1))
for (cv in seq(1:cv)) {
# assign chunk to data test
dataTestIndex <- c((cv * cvDivider):(cv * cvDivider + cvDivider))
dataTest <- vehicles[dataTestIndex,]
# everything else to train
dataTrain <- vehicles[-dataTestIndex,]
cylModel <- multinom(cylinders~., data=dataTrain, maxit=50, trace=T)
pred <- predict(cylModel, newdata=dataTest, type="class")
# classification error
cv_ac <- postResample(dataTest$cylinders, pred)[[1]]
print(paste('Current Accuracy:',cv_ac,'for CV:',cv))
totalAccuracy <- c(totalAccuracy, cv_ac)
}
## stopped after 50 iterations
## [1] "Current Accuracy: 0.62559542711972 for CV: 1"
...
## stopped after 50 iterations
## [1] "Current Accuracy: 0.650682756430613 for CV: 2"
...
## stopped after 50 iterations
## [1] "Current Accuracy: 0.613210543029533 for CV: 3"
...
## stopped after 50 iterations
## [1] "Current Accuracy: 0.657669101302001 for CV: 4"
...
## stopped after 50 iterations
## [1] "Current Accuracy: 0.607494442680216 for CV: 5"
...
## stopped after 50 iterations
## [1] "Current Accuracy: 0.644649094950778 for CV: 6"
...
## stopped after 50 iterations
## [1] "Current Accuracy: 0.70593839314068 for CV: 7"
...
## stopped after 50 iterations
## [1] "Current Accuracy: 0.621149571292474 for CV: 8"
...
## stopped after 50 iterations
## [1] "Current Accuracy: 0.59892029215624 for CV: 9"
...
## stopped after 50 iterations
## [1] "Current Accuracy: 0.62432518259765 for CV: 10"
mean(totalAccuracy)
## [1] 0.635
The mean accuracy of 0.635 is much lower than the accuracy of 0.9034 that we got with the original simple split. You will notice that the log output never prints the word converged. This means the model never reaches the lowest error or global minima and therefore isn’t the best fit.
Let’s try this again and let the model converge by setting the maxit
to a large number
totalAccuracy <- c()
cv <- 10
cvDivider <- floor(nrow(vehicles) / (cv+1))
for (cv in seq(1:cv)) {
# assign chunk to data test
dataTestIndex <- c((cv * cvDivider):(cv * cvDivider + cvDivider))
dataTest <- vehicles[dataTestIndex,]
# everything else to train
dataTrain <- vehicles[-dataTestIndex,]
cylModel <- multinom(cylinders~., data=dataTrain, maxit=1000, trace=T)
pred <- predict(cylModel, newdata=dataTest, type="class")
# classification error
cv_ac <- postResample(dataTest$cylinders, pred)[[1]]
print(paste('Current Accuracy:',cv_ac,'for CV:',cv))
totalAccuracy <- c(totalAccuracy, cv_ac)
}
## converged
## [1] "Current Accuracy: 0.905366783105748 for CV: 1"
## converged
## [1] "Current Accuracy: 0.89838043823436 for CV: 2"
## converged
## [1] "Current Accuracy: 0.909812638932995 for CV: 3"
## converged
## [1] "Current Accuracy: 0.904096538583677 for CV: 4"
## converged
## [1] "Current Accuracy: 0.906637027627818 for CV: 5"
## converged
## [1] "Current Accuracy: 0.906001905366783 for CV: 6"
## converged
## [1] "Current Accuracy: 0.911400444585583 for CV: 7"
## converged
## [1] "Current Accuracy: 0.902508732931089 for CV: 8"
## converged
## [1] "Current Accuracy: 0.90377897745316 for CV: 9"
## converged
## [1] "Current Accuracy: 0.904414099714195 for CV: 10"
mean(totalAccuracy)
## [1] 0.9052
The score using the repeated cross validation code is better than the original simple split of 0.9304 and we let each loop converge. The point of using the repeated cross validation code isn’t that it will return a higher accuracy score (and it doesn’t always) but that it will give you a much more accurate score as it uses all of your data.
Full source code (also on GitHub):
library(nnet)
?multinom
library(caret)
library(RCurl)
library(Metrics)
#####################################################################
# Load data from Hadley Wickham on Github
urlfile <-'https://raw.githubusercontent.com/hadley/fueleconomy/master/data-raw/vehicles.csv'
x <- getURL(urlfile, ssl.verifypeer = FALSE)
vehicles <- read.csv(textConnection(x))
# clean up the data and only use the first 24 columns
vehicles <- vehicles[names(vehicles)[1:24]]
vehicles[is.na(vehicles)] <- 0
# use cyclinder column as outcome and cast to factor
vehicles$cylinders <- as.factor(vehicles$cylinders)
table(vehicles$cylinders)
#####################################################################
# shuffle and split
set.seed(1234)
vehicles <- vehicles[sample(nrow(vehicles)),]
split <- floor(nrow(vehicles)/2)
vehiclesTrain <- vehicles[0:split,]
vehiclesTest <- vehicles[(split+1):nrow(vehicles),]
# see how multinom predicts cylinders
# set the maxit to a large number, enough so the neural net can converge to smallest error
cylModel <- multinom(cylinders~., data=vehiclesTrain, maxit=500, trace=T)
# Sort by most influential variables
topModels <- varImp(cylModel)
topModels$Variables <- row.names(topModels)
topModels <- topModels[order(-topModels$Overall),]
# class/probs (best option, second best option?)
preds1 <- predict(cylModel, type="probs", newdata=vehiclesTest)
preds2 <- predict(cylModel, type="class", newdata=vehiclesTest)
# resample for accuracy - the mean squared error and R-squared are calculated of forfactors, the overall agreement rate and Kappa
postResample(vehiclesTest$cylinders,preds2)[[1]]
library(Metrics)
classificationError <- ce(as.numeric(vehiclesTest$cylinders), as.numeric(preds2))
# repeat cross validate by iterating through all the data to give every variable a chance of being test and train portions
totalError <- c()
cv <- 10
maxiterations <- 500 # try it again with a lower value and notice the mean error
cvDivider <- floor(nrow(vehiclesTrain) / (cv+1))
for (cv in seq(1:cv)) {
# assign chunk to data test
dataTestIndex <- c((cv * cvDivider):(cv * cvDivider + cvDivider))
dataTest <- vehiclesTrain[dataTestIndex,]
# everything else to train
dataTrain <- vehiclesTrain[-dataTestIndex,]
cylModel <- multinom(cylinders~., data=dataTrain, maxit=maxiterations, trace=T)
pred <- predict(cylModel, dataTest)
# classification error
err <- ce(as.numeric(dataTest$cylinders), as.numeric(pred))
totalError <- c(totalError, err)
}
print(paste('Mean error of all repeated cross validations:',mean(totalError)))
Manuel Amunategui - Follow me on Twitter: @amunategui