The Sparse Matrix and {glmnet}
Practical walkthroughs on machine learning, data exploration and finding insight.
Resources
Packages Used in this Walkthrough
- {Matrix} - creates sparse/dense matrices
- {glmnet} - generalized linear models
- {pROC} - ROC tools
In this walkthough, I am going to show how sparse matrices work in R and how to use them with the GLMNET package.
For those that aren’t familiar with sparse matrices, or the sparse matrix, as the name implies, it is a large but ideally hollow data set. If your data contains lots of zeros then a sparse matrix is a very memory-efficient way of holding that data. For example, if you have a lot of dummy variables, most of that data will be zeros, and a sparse matrix will only hold non-zero data and ignore the zeros, thus using a lot less memory and allowing the memorization of much larger data sets than traditional data frames.
Wikipedia has great write-up on the sparse matrix and related theories if you want to dive into this in more details.
Unfortunately the sparse matrix in R doesn’t accept NAs, NaNs and Infinites… Also, normalization functions, such as centering or scaling, could affect the zero values and render the data set into a non-sparse matrix and defeating any memory-efficient advantages.
Let’s start with a simple data set called some_dataframe
:
print(some_dataframe)
## c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 outcome
## 1 2 7 0 0 0 0 0 0 0 0 0
## 2 0 0 3 0 0 0 0 0 0 0 0
## 3 0 0 0 6 1 0 0 0 0 0 0
## 4 0 0 0 2 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 12 0 1
## 6 0 0 0 0 0 25 0 0 0 0 1
## 7 1 0 0 0 2 0 0 0 0 0 0
## 8 0 0 0 2 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 14 0 1
## 10 0 0 0 0 0 21 0 0 0 0 1
## 11 0 0 0 0 0 0 28 0 0 0 1
## 12 0 0 0 0 0 0 0 35 0 0 1
## 13 0 0 0 0 0 0 0 0 42 0 1
## 14 0 0 0 0 0 0 0 0 0 49 1
To see the above data.frame in a matrix format simply requires casting it to a matrix:
some_matrix <- data.matrix(some_dataframe[1:10])
print(some_matrix)
## c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
## [1,] 2 7 0 0 0 0 0 0 0 0
## [2,] 0 0 3 0 0 0 0 0 0 0
## [3,] 0 0 0 6 1 0 0 0 0 0
## [4,] 0 0 0 2 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0 0 0 12 0
## [6,] 0 0 0 0 0 25 0 0 0 0
## [7,] 1 0 0 0 2 0 0 0 0 0
## [8,] 0 0 0 2 0 0 0 0 0 0
## [9,] 0 0 0 0 0 0 0 0 14 0
## [10,] 0 0 0 0 0 21 0 0 0 0
## [11,] 0 0 0 0 0 0 28 0 0 0
## [12,] 0 0 0 0 0 0 0 35 0 0
## [13,] 0 0 0 0 0 0 0 0 42 0
## [14,] 0 0 0 0 0 0 0 0 0 49
Visually, it isn’t much different than the data frame (internally, a matrix is restricted to one data type only). In order to transform it into a sparse matrix, we load the Matrix library, call the Matrix
function with the sparse
flag set to true:
library(Matrix)
print(Matrix(some_matrix, sparse=TRUE))
## 14 x 10 sparse Matrix of class "dgCMatrix"
## [[ suppressing 10 column names 'c1', 'c2', 'c3' ... ]]
##
## [1,] 2 7 . . . . . . . .
## [2,] . . 3 . . . . . . .
## [3,] . . . 6 1 . . . . .
## [4,] . . . 2 . . . . . .
## [5,] . . . . . . . . 12 .
## [6,] . . . . . 25 . . . .
## [7,] 1 . . . 2 . . . . .
## [8,] . . . 2 . . . . . .
## [9,] . . . . . . . . 14 .
## [10,] . . . . . 21 . . . .
## [11,] . . . . . . 28 . . .
## [12,] . . . . . . . 35 . .
## [13,] . . . . . . . . 42 .
## [14,] . . . . . . . . . 49
And here we finally get a sense of its efficiency, it only retains the non-zero values!
Hi there, this is Manuel Amunategui- if you're enjoying the content, find more at ViralML.com
GLMNET package
The help files describes the GLMNET
package as a package containing ‘extremely efficient procedures for fitting lasso or elastic-net regularization for linear regression, logistic and multinomial regression models, poisson regression and the Cox model and more (from the help files).
Unfortunately in R, few models support sparse matrices besides GLMNET (that I know of) therefore in conversations about modeling with R, when the subject of sparse matrices comes up, it is usually followed by the glmnet
model.
Let’s start by splitting our some_dataframe
in two parts: a 2/3 portion that will become our training data set and a 1/3 portion for our testing data set (always set the seed
so random draws are reproducible):
set.seed(2)
split <- sample(nrow(some_dataframe), floor(0.7*nrow(some_dataframe)))
train <-some_dataframe[split,]
test <- some_dataframe[-split,]
We then construct a sparse model matrix using the typical R formula:
library(glmnet)
train_sparse <- sparse.model.matrix(~.,train[1:10])
test_sparse <- sparse.model.matrix(~.,test[1:10])
print(train_sparse)
## 9 x 11 sparse Matrix of class "dgCMatrix"
## [[ suppressing 11 column names '(Intercept)', 'c1', 'c2' ... ]]
##
## 3 1 . . . 6 1 . . . . .
## 10 1 . . . . . 21 . . . .
## 7 1 1 . . . 2 . . . . .
## 2 1 . . 3 . . . . . . .
## 13 1 . . . . . . . . 42 .
## 9 1 . . . . . . . . 14 .
## 11 1 . . . . . . 28 . . .
## 6 1 . . . . . 25 . . . .
## 14 1 . . . . . . . . . 49
print(test_sparse)
## 5 x 11 sparse Matrix of class "dgCMatrix"
## [[ suppressing 11 column names '(Intercept)', 'c1', 'c2' ... ]]
##
## 1 1 2 7 . . . . . . . .
## 4 1 . . . 2 . . . . . .
## 5 1 . . . . . . . . 12 .
## 8 1 . . . 2 . . . . . .
## 12 1 . . . . . . . 35 . .
We call the glmnet
model and get our fit:
fit <- glmnet(train_sparse,train[,11])
And call the predict
function to get our probabilities:
pred <- predict(fit, test_sparse, type="class")
print(head(pred[,1:5]))
## s0 s1 s2 s3 s4
## 1 0.6667 0.6742 0.6815 0.6884 0.695
## 4 0.6667 0.6742 0.6815 0.6884 0.695
## 5 0.6667 0.6742 0.6815 0.6884 0.695
## 8 0.6667 0.6742 0.6815 0.6884 0.695
## 12 0.6667 0.6742 0.6815 0.6884 0.695
The reason it returns many probability sets is because glmnet
fits the model for different regularization parameters at the same time. To help us choose the best prediction set, we can use the function cv.glmnet
. This will use cross validation to find the fit with the smallest error. Let’s call cv.glmnet and pass the results to the s
paramter (the prenalty parameter) of the predict
function:
# use cv.glmnet to find best lambda/penalty - choosing small nfolds for cv due to…
# s is the penalty parameter
cv <- cv.glmnet(train_sparse,train[,11],nfolds=3)
pred <- predict(fit, test_sparse,type="response", s=cv$lambda.min)
NOTE: the cv.glmnet
returns various values that may be important for your modeling needs. In particular lambda.min
and lambda.1se
. One is the smallest error and the other is the simplest error. Refer to the help files to find the best one for your needs.
print(names(cv))
## [1] "lambda" "cvm" "cvsd" "cvup" "cvlo"
## [6] "nzero" "name" "glmnet.fit" "lambda.min" "lambda.1se"
As the data is made up, let’s not read too deeply into these predictions:
print(pred)
## 1
## 1 0.9898
## 4 0.8306
## 5 0.9898
## 8 0.8306
## 12 0.9898
Let’s see how the sparse.model.matrix
function of the Matrix package handles discreet values. I added a factor variable mood
with two levels: happy
and sad
:
print(cat_dataframe)
## c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 mood outcome
## 1 2 7 0 0 0 0 0 0 0 0 happy 0
## 2 0 0 3 0 0 0 0 0 0 0 happy 0
## 3 0 0 0 6 1 0 0 0 0 0 happy 0
## 4 0 0 0 2 0 0 0 0 0 0 happy 0
## 5 0 0 0 0 0 0 0 0 12 0 sad 1
## 6 0 0 0 0 0 25 0 0 0 0 sad 1
## 7 1 0 0 0 2 0 0 0 0 0 happy 0
## 8 0 0 0 2 0 0 0 0 0 0 happy 0
## 9 0 0 0 0 0 0 0 0 14 0 sad 1
## 10 0 0 0 0 0 21 0 0 0 0 sad 1
## 11 0 0 0 0 0 0 28 0 0 0 sad 1
## 12 0 0 0 0 0 0 0 35 0 0 sad 1
## 13 0 0 0 0 0 0 0 0 42 0 sad 1
## 14 0 0 0 0 0 0 0 0 0 49 sad 1
We call the sparse.model.matrix
function and we notice that it turned happy
into 0’s and sad
into 1’s. Thus, we only see the 1’s:
sparse.model.matrix(~.,cat_dataframe)
## 14 x 13 sparse Matrix of class "dgCMatrix"
## [[ suppressing 13 column names '(Intercept)', 'c1', 'c2' ... ]]
##
## 1 1 2 7 . . . . . . . . . .
## 2 1 . . 3 . . . . . . . . .
## 3 1 . . . 6 1 . . . . . . .
## 4 1 . . . 2 . . . . . . . .
## 5 1 . . . . . . . . 12 . 1 1
## 6 1 . . . . . 25 . . . . 1 1
## 7 1 1 . . . 2 . . . . . . .
## 8 1 . . . 2 . . . . . . . .
## 9 1 . . . . . . . . 14 . 1 1
## 10 1 . . . . . 21 . . . . 1 1
## 11 1 . . . . . . 28 . . . 1 1
## 12 1 . . . . . . . 35 . . 1 1
## 13 1 . . . . . . . . 42 . 1 1
## 14 1 . . . . . . . . . 49 1 1
Let’s complicate things by adding a few more levels to our factor:
print(cat_dataframe)
## c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 mood outcome
## 1 2 7 0 0 0 0 0 0 0 0 angry 0
## 2 0 0 3 0 0 0 0 0 0 0 neutral 0
## 3 0 0 0 6 1 0 0 0 0 0 happy 0
## 4 0 0 0 2 0 0 0 0 0 0 happy 0
## 5 0 0 0 0 0 0 0 0 12 0 sad 1
## 6 0 0 0 0 0 25 0 0 0 0 sad 1
## 7 1 0 0 0 2 0 0 0 0 0 happy 0
## 8 0 0 0 2 0 0 0 0 0 0 happy 0
## 9 0 0 0 0 0 0 0 0 14 0 sad 1
## 10 0 0 0 0 0 21 0 0 0 0 neutral 1
## 11 0 0 0 0 0 0 28 0 0 0 sad 1
## 12 0 0 0 0 0 0 0 35 0 0 sad 1
## 13 0 0 0 0 0 0 0 0 42 0 sad 1
## 14 0 0 0 0 0 0 0 0 0 49 sad 1
print(levels(cat_dataframe$mood))
## [1] "angry" "happy" "neutral" "sad"
We can compare the dimensions of both data sets:
dim(cat_dataframe)
## [1] 14 12
dim(sparse.model.matrix(~.,cat_dataframe))
## [1] 14 15
The sparse model has 3 extra columns. It broke out the 4 levels into 3. This is because it applied Full Rank
to the set - you’re either one of the 3 moods, if you’re neither of the 3, then you’re assumed to be the 4th or angry
. Check out my walkthrough on dummy variables for more details:
print(sparse.model.matrix(~.,cat_dataframe))
## 14 x 15 sparse Matrix of class "dgCMatrix"
## [[ suppressing 15 column names '(Intercept)', 'c1', 'c2' ... ]]
##
## 1 1 2 7 . . . . . . . . . . . .
## 2 1 . . 3 . . . . . . . . 1 . .
## 3 1 . . . 6 1 . . . . . 1 . . .
## 4 1 . . . 2 . . . . . . 1 . . .
## 5 1 . . . . . . . . 12 . . . 1 1
## 6 1 . . . . . 25 . . . . . . 1 1
## 7 1 1 . . . 2 . . . . . 1 . . .
## 8 1 . . . 2 . . . . . . 1 . . .
## 9 1 . . . . . . . . 14 . . . 1 1
## 10 1 . . . . . 21 . . . . . 1 . 1
## 11 1 . . . . . . 28 . . . . . 1 1
## 12 1 . . . . . . . 35 . . . . 1 1
## 13 1 . . . . . . . . 42 . . . 1 1
## 14 1 . . . . . . . . . 49 . . 1 1
Full source code (also on GitHub):
some_dataframe <- read.table(text="c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 outcome
2 7 0 0 0 0 0 0 0 0 0
0 0 3 0 0 0 0 0 0 0 0
0 0 0 6 1 0 0 0 0 0 0
0 0 0 2 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 12 0 1
0 0 0 0 0 25 0 0 0 0 1
1 0 0 0 2 0 0 0 0 0 0
0 0 0 2 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 14 0 1
0 0 0 0 0 21 0 0 0 0 1
0 0 0 0 0 0 28 0 0 0 1
0 0 0 0 0 0 0 35 0 0 1
0 0 0 0 0 0 0 0 42 0 1
0 0 0 0 0 0 0 0 0 49 1", header=T, sep="")
library(Matrix)
some_matrix <- data.matrix(some_dataframe[1:10])
# show matrix representation of data set
Matrix(some_matrix, sparse=TRUE)
# split data set into a train and test portion
set.seed(2)
split &llt;- sample(nrow(some_dataframe), floor(0.7*nrow(some_dataframe)))
train <-some_dataframe[split,]
test <- some_dataframe[-split,]
# transform both sets into sparse matrices using the sparse.model.matrix
train_sparse <- sparse.model.matrix(~.,train[1:10])
test_sparse <- sparse.model.matrix(~.,test[1:10])
# model the sparse sets using glmnet
library(glmnet)
fit <- glmnet(train_sparse,train[,11])
# use cv.glmnet to find best lambda/penalty
# s is the penalty parameter
cv <- cv.glmnet(train_sparse,train[,11],nfolds=3)
pred <- predict(fit, test_sparse,type="response", s=cv$lambda.min)
# receiver operating characteristic (ROC curves)
library(pROC)
auc = roc(test[,11], pred)
print(auc$auc)
# how does sparse deal with categorical data (adding mood feature with two levels)?
cat_dataframe<- read.table(text="c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 mood outcome
2 7 0 0 0 0 0 0 0 0 happy 0
0 0 3 0 0 0 0 0 0 0 happy 0
0 0 0 6 1 0 0 0 0 0 happy 0
0 0 0 2 0 0 0 0 0 0 happy 0
0 0 0 0 0 0 0 0 12 0 sad 1
0 0 0 0 0 25 0 0 0 0 sad 1
1 0 0 0 2 0 0 0 0 0 happy 0
0 0 0 2 0 0 0 0 0 0 happy 0
0 0 0 0 0 0 0 0 14 0 sad 1
0 0 0 0 0 21 0 0 0 0 sad 1
0 0 0 0 0 0 28 0 0 0 sad 1
0 0 0 0 0 0 0 35 0 0 sad 1
0 0 0 0 0 0 0 0 42 0 sad 1
0 0 0 0 0 0 0 0 0 49 sad 1", header=T, sep="")
print(sparse.model.matrix(~.,cat_dataframe))
# increasing the number of levels in the mood variable)
cat_dataframe <- read.table(text="c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 mood outcome
2 7 0 0 0 0 0 0 0 0 angry 0
0 0 3 0 0 0 0 0 0 0 neutral 0
0 0 0 6 1 0 0 0 0 0 happy 0
0 0 0 2 0 0 0 0 0 0 happy 0
0 0 0 0 0 0 0 0 12 0 sad 1
0 0 0 0 0 25 0 0 0 0 sad 1
1 0 0 0 2 0 0 0 0 0 happy 0
0 0 0 2 0 0 0 0 0 0 happy 0
0 0 0 0 0 0 0 0 14 0 sad 1
0 0 0 0 0 21 0 0 0 0 neutral 1
0 0 0 0 0 0 28 0 0 0 sad 1
0 0 0 0 0 0 0 35 0 0 sad 1
0 0 0 0 0 0 0 0 42 0 sad 1
0 0 0 0 0 0 0 0 0 49 sad 1", header=T, sep="")
print(levels(cat_dataframe$mood))
dim(cat_dataframe)
# sparse added extra columns when in binarized mood
dim(sparse.model.matrix(~.,cat_dataframe))
Manuel Amunategui - Follow me on Twitter: @amunategui