About Manuel Amunategui

Data scientist with over 20-years experience in the tech industry, MAs in Predictive Analytics and International Administration, co-author of Monetizing Machine Learning and VP of Data Science at SpringML.

From consulting in machine learning, healthcare modeling, 6 years on Wall Street in the financial industry, and 4 years at Microsoft, I feel like I’ve seen it all. And this has opened my eyes to the huge gap in educational material on applied data science. Like I say:

It just ain’t real 'til it reaches your customer’s plate

I am a startup advisor and available for speaking engagements with companies and schools on topics around building and motivating data science teams, and all things applied machine learning.

Reach me at amunategui@gmail.com

Data Exploration & Machine Learning, Hands-on

Recommended free walkthrough, check it out and boost your career:

The Sparse Matrix and {glmnet}

Practical walkthroughs on machine learning, data exploration and finding insight.

Resources

Packages Used in this Walkthrough

{Matrix} - creates sparse/dense matrices
{glmnet} - generalized linear models
{pROC} - ROC tools

In this walkthough, I am going to show how sparse matrices work in R and how to use them with the GLMNET package.

For those that aren’t familiar with sparse matrices, or the sparse matrix, as the name implies, it is a large but ideally hollow data set. If your data contains lots of zeros then a sparse matrix is a very memory-efficient way of holding that data. For example, if you have a lot of dummy variables, most of that data will be zeros, and a sparse matrix will only hold non-zero data and ignore the zeros, thus using a lot less memory and allowing the memorization of much larger data sets than traditional data frames.

Wikipedia has great write-up on the sparse matrix and related theories if you want to dive into this in more details.

Unfortunately the sparse matrix in R doesn’t accept NAs, NaNs and Infinites… Also, normalization functions, such as centering or scaling, could affect the zero values and render the data set into a non-sparse matrix and defeating any memory-efficient advantages.

Let’s start with a simple data set called some_dataframe:

print(some_dataframe)

##    c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 outcome
## 1   2  7  0  0  0  0  0  0  0   0       0
## 2   0  0  3  0  0  0  0  0  0   0       0
## 3   0  0  0  6  1  0  0  0  0   0       0
## 4   0  0  0  2  0  0  0  0  0   0       0
## 5   0  0  0  0  0  0  0  0 12   0       1
## 6   0  0  0  0  0 25  0  0  0   0       1
## 7   1  0  0  0  2  0  0  0  0   0       0
## 8   0  0  0  2  0  0  0  0  0   0       0
## 9   0  0  0  0  0  0  0  0 14   0       1
## 10  0  0  0  0  0 21  0  0  0   0       1
## 11  0  0  0  0  0  0 28  0  0   0       1
## 12  0  0  0  0  0  0  0 35  0   0       1
## 13  0  0  0  0  0  0  0  0 42   0       1
## 14  0  0  0  0  0  0  0  0  0  49       1

To see the above data.frame in a matrix format simply requires casting it to a matrix:

some_matrix <- data.matrix(some_dataframe[1:10])
print(some_matrix)

##       c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
##  [1,]  2  7  0  0  0  0  0  0  0   0
##  [2,]  0  0  3  0  0  0  0  0  0   0
##  [3,]  0  0  0  6  1  0  0  0  0   0
##  [4,]  0  0  0  2  0  0  0  0  0   0
##  [5,]  0  0  0  0  0  0  0  0 12   0
##  [6,]  0  0  0  0  0 25  0  0  0   0
##  [7,]  1  0  0  0  2  0  0  0  0   0
##  [8,]  0  0  0  2  0  0  0  0  0   0
##  [9,]  0  0  0  0  0  0  0  0 14   0
## [10,]  0  0  0  0  0 21  0  0  0   0
## [11,]  0  0  0  0  0  0 28  0  0   0
## [12,]  0  0  0  0  0  0  0 35  0   0
## [13,]  0  0  0  0  0  0  0  0 42   0
## [14,]  0  0  0  0  0  0  0  0  0  49

Visually, it isn’t much different than the data frame (internally, a matrix is restricted to one data type only). In order to transform it into a sparse matrix, we load the Matrix library, call the Matrix function with the sparse flag set to true:

library(Matrix)
print(Matrix(some_matrix, sparse=TRUE))

## 14 x 10 sparse Matrix of class "dgCMatrix"

##    [[ suppressing 10 column names 'c1', 'c2', 'c3' ... ]]

##                               
##  [1,] 2 7 . . .  .  .  .  .  .
##  [2,] . . 3 . .  .  .  .  .  .
##  [3,] . . . 6 1  .  .  .  .  .
##  [4,] . . . 2 .  .  .  .  .  .
##  [5,] . . . . .  .  .  . 12  .
##  [6,] . . . . . 25  .  .  .  .
##  [7,] 1 . . . 2  .  .  .  .  .
##  [8,] . . . 2 .  .  .  .  .  .
##  [9,] . . . . .  .  .  . 14  .
## [10,] . . . . . 21  .  .  .  .
## [11,] . . . . .  . 28  .  .  .
## [12,] . . . . .  .  . 35  .  .
## [13,] . . . . .  .  .  . 42  .
## [14,] . . . . .  .  .  .  . 49

And here we finally get a sense of its efficiency, it only retains the non-zero values!

Hi there, this is Manuel Amunategui- if you're enjoying the content, find more at ViralML.com

GLMNET package
The help files describes the GLMNET package as a package containing ‘extremely efficient procedures for fitting lasso or elastic-net regularization for linear regression, logistic and multinomial regression models, poisson regression and the Cox model and more (from the help files).

Unfortunately in R, few models support sparse matrices besides GLMNET (that I know of) therefore in conversations about modeling with R, when the subject of sparse matrices comes up, it is usually followed by the glmnet model.

Let’s start by splitting our some_dataframe in two parts: a 2/3 portion that will become our training data set and a 1/3 portion for our testing data set (always set the seed so random draws are reproducible):

set.seed(2)
split <- sample(nrow(some_dataframe), floor(0.7*nrow(some_dataframe)))
train <-some_dataframe[split,]
test <- some_dataframe[-split,]

We then construct a sparse model matrix using the typical R formula:

library(glmnet) 
train_sparse <- sparse.model.matrix(~.,train[1:10])
test_sparse <- sparse.model.matrix(~.,test[1:10])
print(train_sparse)

## 9 x 11 sparse Matrix of class "dgCMatrix"

##    [[ suppressing 11 column names '(Intercept)', 'c1', 'c2' ... ]]

##                             
## 3  1 . . . 6 1  .  . .  .  .
## 10 1 . . . . . 21  . .  .  .
## 7  1 1 . . . 2  .  . .  .  .
## 2  1 . . 3 . .  .  . .  .  .
## 13 1 . . . . .  .  . . 42  .
## 9  1 . . . . .  .  . . 14  .
## 11 1 . . . . .  . 28 .  .  .
## 6  1 . . . . . 25  . .  .  .
## 14 1 . . . . .  .  . .  . 49

print(test_sparse)

## 5 x 11 sparse Matrix of class "dgCMatrix"

##    [[ suppressing 11 column names '(Intercept)', 'c1', 'c2' ... ]]

##                           
## 1  1 2 7 . . . . .  .  . .
## 4  1 . . . 2 . . .  .  . .
## 5  1 . . . . . . .  . 12 .
## 8  1 . . . 2 . . .  .  . .
## 12 1 . . . . . . . 35  . .

We call the glmnet model and get our fit:

fit <- glmnet(train_sparse,train[,11])

And call the predict function to get our probabilities:

pred <- predict(fit, test_sparse, type="class")
print(head(pred[,1:5]))

##        s0     s1     s2     s3    s4
## 1  0.6667 0.6742 0.6815 0.6884 0.695
## 4  0.6667 0.6742 0.6815 0.6884 0.695
## 5  0.6667 0.6742 0.6815 0.6884 0.695
## 8  0.6667 0.6742 0.6815 0.6884 0.695
## 12 0.6667 0.6742 0.6815 0.6884 0.695

The reason it returns many probability sets is because glmnet fits the model for different regularization parameters at the same time. To help us choose the best prediction set, we can use the function cv.glmnet. This will use cross validation to find the fit with the smallest error. Let’s call cv.glmnet and pass the results to the s paramter (the prenalty parameter) of the predict function:

# use cv.glmnet to find best lambda/penalty - choosing small nfolds for cv due to… 
# s is the penalty parameter
cv <- cv.glmnet(train_sparse,train[,11],nfolds=3)
pred <- predict(fit, test_sparse,type="response", s=cv$lambda.min)

NOTE: the cv.glmnet returns various values that may be important for your modeling needs. In particular lambda.min and lambda.1se. One is the smallest error and the other is the simplest error. Refer to the help files to find the best one for your needs.

print(names(cv))

##  [1] "lambda"     "cvm"        "cvsd"       "cvup"       "cvlo"      
##  [6] "nzero"      "name"       "glmnet.fit" "lambda.min" "lambda.1se"

As the data is made up, let’s not read too deeply into these predictions:

print(pred)

##         1
## 1  0.9898
## 4  0.8306
## 5  0.9898
## 8  0.8306
## 12 0.9898

Let’s see how the sparse.model.matrix function of the Matrix package handles discreet values. I added a factor variable mood with two levels: happy and sad:

print(cat_dataframe)

##    c1 c2 c3 c4 c5 c6 c7 c8 c9 c10  mood outcome
## 1   2  7  0  0  0  0  0  0  0   0 happy       0
## 2   0  0  3  0  0  0  0  0  0   0 happy       0
## 3   0  0  0  6  1  0  0  0  0   0 happy       0
## 4   0  0  0  2  0  0  0  0  0   0 happy       0
## 5   0  0  0  0  0  0  0  0 12   0   sad       1
## 6   0  0  0  0  0 25  0  0  0   0   sad       1
## 7   1  0  0  0  2  0  0  0  0   0 happy       0
## 8   0  0  0  2  0  0  0  0  0   0 happy       0
## 9   0  0  0  0  0  0  0  0 14   0   sad       1
## 10  0  0  0  0  0 21  0  0  0   0   sad       1
## 11  0  0  0  0  0  0 28  0  0   0   sad       1
## 12  0  0  0  0  0  0  0 35  0   0   sad       1
## 13  0  0  0  0  0  0  0  0 42   0   sad       1
## 14  0  0  0  0  0  0  0  0  0  49   sad       1

We call the sparse.model.matrix function and we notice that it turned happy into 0’s and sad into 1’s. Thus, we only see the 1’s:

sparse.model.matrix(~.,cat_dataframe)

## 14 x 13 sparse Matrix of class "dgCMatrix"

##    [[ suppressing 13 column names '(Intercept)', 'c1', 'c2' ... ]]

##                                  
## 1  1 2 7 . . .  .  .  .  .  . . .
## 2  1 . . 3 . .  .  .  .  .  . . .
## 3  1 . . . 6 1  .  .  .  .  . . .
## 4  1 . . . 2 .  .  .  .  .  . . .
## 5  1 . . . . .  .  .  . 12  . 1 1
## 6  1 . . . . . 25  .  .  .  . 1 1
## 7  1 1 . . . 2  .  .  .  .  . . .
## 8  1 . . . 2 .  .  .  .  .  . . .
## 9  1 . . . . .  .  .  . 14  . 1 1
## 10 1 . . . . . 21  .  .  .  . 1 1
## 11 1 . . . . .  . 28  .  .  . 1 1
## 12 1 . . . . .  .  . 35  .  . 1 1
## 13 1 . . . . .  .  .  . 42  . 1 1
## 14 1 . . . . .  .  .  .  . 49 1 1

Let’s complicate things by adding a few more levels to our factor:

print(cat_dataframe)

##    c1 c2 c3 c4 c5 c6 c7 c8 c9 c10    mood outcome
## 1   2  7  0  0  0  0  0  0  0   0   angry       0
## 2   0  0  3  0  0  0  0  0  0   0 neutral       0
## 3   0  0  0  6  1  0  0  0  0   0   happy       0
## 4   0  0  0  2  0  0  0  0  0   0   happy       0
## 5   0  0  0  0  0  0  0  0 12   0     sad       1
## 6   0  0  0  0  0 25  0  0  0   0     sad       1
## 7   1  0  0  0  2  0  0  0  0   0   happy       0
## 8   0  0  0  2  0  0  0  0  0   0   happy       0
## 9   0  0  0  0  0  0  0  0 14   0     sad       1
## 10  0  0  0  0  0 21  0  0  0   0 neutral       1
## 11  0  0  0  0  0  0 28  0  0   0     sad       1
## 12  0  0  0  0  0  0  0 35  0   0     sad       1
## 13  0  0  0  0  0  0  0  0 42   0     sad       1
## 14  0  0  0  0  0  0  0  0  0  49     sad       1

print(levels(cat_dataframe$mood))

## [1] "angry"   "happy"   "neutral" "sad"

We can compare the dimensions of both data sets:

dim(cat_dataframe)

## [1] 14 12

dim(sparse.model.matrix(~.,cat_dataframe))

## [1] 14 15

The sparse model has 3 extra columns. It broke out the 4 levels into 3. This is because it applied Full Rank to the set - you’re either one of the 3 moods, if you’re neither of the 3, then you’re assumed to be the 4th or angry. Check out my walkthrough on dummy variables for more details:

print(sparse.model.matrix(~.,cat_dataframe))

## 14 x 15 sparse Matrix of class "dgCMatrix"

##    [[ suppressing 15 column names '(Intercept)', 'c1', 'c2' ... ]]

##                                      
## 1  1 2 7 . . .  .  .  .  .  . . . . .
## 2  1 . . 3 . .  .  .  .  .  . . 1 . .
## 3  1 . . . 6 1  .  .  .  .  . 1 . . .
## 4  1 . . . 2 .  .  .  .  .  . 1 . . .
## 5  1 . . . . .  .  .  . 12  . . . 1 1
## 6  1 . . . . . 25  .  .  .  . . . 1 1
## 7  1 1 . . . 2  .  .  .  .  . 1 . . .
## 8  1 . . . 2 .  .  .  .  .  . 1 . . .
## 9  1 . . . . .  .  .  . 14  . . . 1 1
## 10 1 . . . . . 21  .  .  .  . . 1 . 1
## 11 1 . . . . .  . 28  .  .  . . . 1 1
## 12 1 . . . . .  .  . 35  .  . . . 1 1
## 13 1 . . . . .  .  .  . 42  . . . 1 1
## 14 1 . . . . .  .  .  .  . 49 . . 1 1

Full source code (also on GitHub):


some_dataframe <- read.table(text="c1        c2     c3     c4     c5     c6     c7     c8     c9     c10     outcome
2     7     0     0     0     0     0     0     0     0     0
0     0     3     0     0     0     0     0     0     0     0
0     0     0     6     1     0     0     0     0     0     0
0     0     0     2     0     0     0     0     0     0     0
0     0     0     0     0     0     0     0     12     0     1
0     0     0     0     0     25     0     0     0     0     1
1     0     0     0     2     0     0     0     0     0     0
0     0     0     2     0     0     0     0     0     0     0
0     0     0     0     0     0     0     0     14     0     1
0     0     0     0     0     21     0     0     0     0     1
0     0     0     0     0     0     28     0     0     0     1
0     0     0     0     0     0     0     35     0     0     1
0     0     0     0     0     0     0     0     42     0     1
0     0     0     0     0     0     0     0     0     49     1", header=T, sep="") 

library(Matrix)
some_matrix <- data.matrix(some_dataframe[1:10])

# show matrix representation of data set
Matrix(some_matrix, sparse=TRUE)

# split data set into a train and test portion
set.seed(2)
split &llt;- sample(nrow(some_dataframe), floor(0.7*nrow(some_dataframe)))
train <-some_dataframe[split,]
test <- some_dataframe[-split,]

# transform both sets into sparse matrices using the sparse.model.matrix
train_sparse <- sparse.model.matrix(~.,train[1:10])
test_sparse <- sparse.model.matrix(~.,test[1:10])

# model the sparse sets using glmnet
library(glmnet)  
fit <- glmnet(train_sparse,train[,11])

# use cv.glmnet to find best lambda/penalty 
# s is the penalty parameter
cv <- cv.glmnet(train_sparse,train[,11],nfolds=3)
pred <- predict(fit, test_sparse,type="response", s=cv$lambda.min)

#  receiver operating characteristic (ROC curves)
library(pROC)  
auc = roc(test[,11], pred)
print(auc$auc)

# how does sparse deal with categorical data (adding mood feature with two levels)?
cat_dataframe<- read.table(text="c1     c2     c3     c4     c5     c6     c7     c8     c9     c10     mood     outcome
2     7     0     0     0     0     0     0     0     0     happy     0
0     0     3     0     0     0     0     0     0     0     happy     0
0     0     0     6     1     0     0     0     0     0     happy     0
0     0     0     2     0     0     0     0     0     0     happy     0
0     0     0     0     0     0     0     0     12     0     sad     1
0     0     0     0     0     25     0     0     0     0     sad     1
1     0     0     0     2     0     0     0     0     0     happy     0
0     0     0     2     0     0     0     0     0     0     happy     0
0     0     0     0     0     0     0     0     14     0     sad     1
0     0     0     0     0     21     0     0     0     0     sad     1
0     0     0     0     0     0     28     0     0     0     sad     1
0     0     0     0     0     0     0     35     0     0     sad     1
0     0     0     0     0     0     0     0     42     0     sad     1
0     0     0     0     0     0     0     0     0     49     sad     1", header=T, sep="") 
print(sparse.model.matrix(~.,cat_dataframe))

# increasing the number of levels in the mood variable)
cat_dataframe <- read.table(text="c1     c2     c3     c4     c5     c6     c7     c8     c9     c10     mood     outcome
2     7     0     0     0     0     0     0     0     0     angry     0
0     0     3     0     0     0     0     0     0     0     neutral     0
0     0     0     6     1     0     0     0     0     0     happy     0
0     0     0     2     0     0     0     0     0     0     happy     0
0     0     0     0     0     0     0     0     12     0     sad     1
0     0     0     0     0     25     0     0     0     0     sad     1
1     0     0     0     2     0     0     0     0     0     happy     0
0     0     0     2     0     0     0     0     0     0     happy     0
0     0     0     0     0     0     0     0     14     0     sad     1
0     0     0     0     0     21     0     0     0     0     neutral     1
0     0     0     0     0     0     28     0     0     0     sad     1
0     0     0     0     0     0     0     35     0     0     sad     1
0     0     0     0     0     0     0     0     42     0     sad     1
0     0     0     0     0     0     0     0     0     49     sad     1", header=T, sep="") 
print(levels(cat_dataframe$mood))
dim(cat_dataframe)
# sparse added extra columns when in binarized mood
dim(sparse.model.matrix(~.,cat_dataframe))

Manuel Amunategui - Follow me on Twitter: @amunategui