Using String Distance {stringdist} To Handle Large Text Factors, Cluster Them Into Supersets

Practical walkthroughs on machine learning, data exploration and finding insight.

Resources


Packages Used in this Walkthrough



If you’re wondering whether you’re getting the most out of a text-based, factor variable from a large data set, then you’re not alone. There are so many ways of deconstructing text variables. If every entry is made of repeated text from a small set of possibilities, then dummifying it is the easiest way to proceed. On the other hand, if every entry is unique, then resorting to Natural Language Processing (NLP) may be required. This article tackles the gray area in between, where the data is neither unique nor small, where dummifying won’t work yet NLP may still be avoided.

So that we are on the same page, imagine a data set with 10 million rows with at least one feature/column being a text-based factor. It is not made up of free-text, where every entry is unique, instead, it is made up of repeated text: for example 10,000 possibilities repeated over 10 million rows. This would be hard to dummify, as it will blow up your feature space, and would take forever to group by hand.

What Is One To Do?



Grouping With {stringdist}

Could those 10,000 possibilities mentioned earlier be grouped into a superset representing only a tenth or a fifth of its original size? What is close to impossible to do by hand is trivial with string distance:

...a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or comparison and in fuzzy string searching. (Source: Wikipedia)

The {strndist} package offers ‘Approximate string matching and string distance functions’. It offers many algorithms but the two I found the most interesting for short sets of words are:

...the Jaro–Winkler distance (Winkler, 1990) is a measure of similarity between two strings. The higher the Jaro–Winkler distance for two strings is, the more similar the strings are. The Jaro–Winkler distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match. (Source: Wikipedia)

and

...the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. (Source: Wikipedia)



Let’s Code!

Enough chitchat, let’s download the vehicles data set from Hadley Wickham hosted on Github. It is a big and diverse data set, perfect for our needs:

library(RCurl)
urlfile <-'https://raw.githubusercontent.com/hadley/fueleconomy/master/data-raw/vehicles.csv'
x <- getURL(urlfile, ssl.verifypeer = FALSE)
vehicles <- read.csv(textConnection(x))

# alternative way of getting the data if the above snippet doesn't work:
# urlData <- getURL('https://raw.githubusercontent.com/hadley/fueleconomy/master/data-raw/vehicles.csv')
# vehicles <- read.csv(text = urlData)



We’re going to focus on one single feature in the data set: model. Let’s start with some basic statistics on that feature to understand what we’re dealing with:

nrow(vehicles)
[1] 34631
length(unique(vehicles$model))
[1] 3234



So, we have a data set of over 34,631 vehicles, but model is comprised of only 3,234 unique model names repeated through out all the observations/rows. Let’s look at the first 100 rows so we don’t get overwhelmed with too much the data:

vehicles_small <- vehicles[1:100,]



Out of those 100 observations, model has only 45 unique model names and here is a small sample of what it holds:

length(unique(vehicles_small$model))
[1] 45
head(unique(as.character(vehicles_small$model)))
[1] "Spider Veloce 2000"  "Testarossa"          "Charger"            
[4] "B150/B250 Wagon 2WD" "Legacy AWD Turbo"    "Loyale"



Let’s run some basic string distance on this subset by calling the stringdistmatrix function to see how it can help us tame the model variable into smaller supersets. In its simplest form, the function stringdistmatrix only requires a unique set of text values and the method to cluster the data:

stringdistmatrix(a, b, method = c("osa", "lv", "dl", "hamming", "lcs",
        "qgram", "cosine", "jaccard", "jw", useBytes = FALSE,
        weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0,
        useNames = FALSE, ncores = 1, cluster = NULL)



We’ll pass it the unique list of models, request the Jaro–Winkler distance algorithm (my favorite for this task), cluster the results into 20 groups with the hclust function and plot the resulting dendrogram:

library(stringdist)
uniquemodels <- unique(as.character(vehicles_small$model))
distancemodels <- stringdistmatrix(uniquemodels,uniquemodels,method = "jw")
rownames(distancemodels) <- uniquemodels
hc <- hclust(as.dist(distancemodels))
plot(hc)
rect.hclust(hc,k=20)

plot of chunk unnamed-chunk-6

stringdistmatrix works in tandem with hclust, one creates the model, the other enforces the clusters. Our algorithm created a Wagon and Taurus group along with some number groups. So far it isn’t extremely impressive as its only using a small subset of data - wait till we open things up!

We now look at a bigger subset of the vehicles data set. Let’s pull the first 2000 observations:

vehicles_small <- vehicles[1:2000,]
length(unique(vehicles_small$model))
[1] 481



Out of those 2000 observations, we have 481 unique model names. Let’s ask stringdistmatrix to group those into 200 groups:

uniquemodels <- unique(as.character(vehicles_small$model))
distancemodels <- stringdistmatrix(uniquemodels,uniquemodels,method = "jw")
rownames(distancemodels) <- uniquemodels
hc <- hclust(as.dist(distancemodels))
dfClust <- data.frame(uniquemodels, cutree(hc, k=200))
names(dfClust) <- c('modelname','cluster')



Let’s visualize the quantities of models in each group created by the Jaro–Winkler distance algorithm:

plot(table(dfClust$cluster))

plot of chunk unnamed-chunk-9

print(paste('Average number of models per cluster:', mean(table(dfClust$cluster))))
## [1] "Average number of models per cluster: 2.405"



The largest cluster contains over 10 models but the average is 2.4 models per cluster. Now, lets look at the top groups and see what the algorithm did (don’t sweat this code, it simply orders the data by cluster size):

t <- table(dfClust$cluster)
t <- cbind(t,t/length(dfClust$cluster))
t <- t[order(t[,2], decreasing=TRUE),]
p <- data.frame(factorName=rownames(t), binCount=t[,1], percentFound=t[,2])
dfClust <- merge(x=dfClust, y=p, by.x = 'cluster', by.y='factorName', all.x=T)
dfClust <- dfClust[rev(order(dfClust$binCount)),]
names(dfClust) <-  c('cluster','modelname')
head (dfClust[c('cluster','modelname')],50)
##     cluster              modelname
## 192      73       K1500 Pickup 4WD
## 191      73         S10 Pickup 2WD
## 190      73        W250 Pickup 4WD
## 189      73        F150 Pickup 2WD
## 188      73         S10 Pickup 4WD
## 187      73   D100/D150 Pickup 2WD
## 186      73        F250 Pickup 2WD
## 185      73       C1500 Pickup 2WD
## 184      73        F150 Pickup 4WD
## 183      73        D250 Pickup 2WD
## 182      73   W100/W150 Pickup 4WD
## 341     123 Postal Cab Chassis 2WD
## 340     123    S10 Cab Chassis 2WD
## 339     123 Dakota Cab Chassis 2WD
## 338     123        Cab/Chassis 2WD
## 337     123        Cab Chassis 2WD
## 336     123    S15 Cab Chassis 2WD
## 335     123  Truck Cab Chassis 2WD
## 334     123   D250 Cab Chassis 2WD
## 236      84         Yukon 1500 4WD
## 235      84       Suburban C10 2WD
## 234      84            SJ 410V 4WD
## 233      84      Suburban 1500 2WD
## 232      84       Suburban K10 4WD
## 231      84        Yukon K1500 4WD
## 230      84             SJ 410 4WD
## 229      84      Suburban 1500 4WD
## 365     130        900 Convertible
## 364     130       318i Convertible
## 363     130            Convertible
## 362     130    XJS V12 Convertible
## 361     130       E320 Convertible
## 360     130       325i Convertible
## 359     130        XJS Convertible
## 307     107     Sidekick 2Door 2WD
## 306     107   Sidekick Hardtop 2WD
## 305     107     Sidekick 4Door 2WD
## 304     107     Sidekick 2Door 4WD
## 303     107           Sidekick 2WD
## 302     107   Sidekick Hardtop 4WD
## 301     107     Sidekick 4Door 4WD
## 86       36              240 Wagon
## 85       36             E320 Wagon
## 84       36              940 Wagon
## 83       36              850 Wagon
## 82       36              960 Wagon
## 81       36        E150 Club Wagon
## 80       36              100 Wagon
## 13        5       Legacy AWD Turbo
## 12        5           Legacy Wagon



Out of the 200 clusters we requested, cluster 73 is the largest holding 11 models. Clearly, it picked up on the word pickup flanked by two words on either side with the right one being 2WD or 4WD. Cluster 123 looked for Cab Chassis, even picking up a Cab/Chassis in the process. You get the idea and, hopefully, are impressed with how a few lines of code reduced 2000 observations into 200 groups. The exact same process would apply to 20,000 observations or 20 million…

Creating New Variables Through Combining Multiple Features

An offshoot of this process is to create new groups by combining existing features and running the results through stringdistmatrix. Let’s try combining model with trany:

vehicles_small$modelAndTrany <- paste0(as.character(vehicles_small$model)," ",as.character(vehicles_small$trany))
print(length(unique(vehicles_small$modelAndTrany)))
## [1] 808



Our new field has 808 unique values out of our 2000 small_vehicles data frame. Let’s run it through the Jaro–Winkler distance algorithm, request 500 clusters and check out the top groups:

uniquemodels <- unique(as.character(vehicles_small$modelAndTrany))
distancemodels <- stringdistmatrix(uniquemodels,uniquemodels,method = "jw")
rownames(distancemodels) <- uniquemodels
hc <- hclust(as.dist(distancemodels))
dfClust <- data.frame(uniquemodels, cutree(hc, k=500))
names(dfClust) <- c('modelname','cluster')
t <- table(dfClust$cluster)
t <- cbind(t,t/length(dfClust$cluster))
t <- t[order(t[,2], decreasing=TRUE),]
p <- data.frame(factorName=rownames(t), binCount=t[,1], percentFound=t[,2])
dfClust <- merge(x=dfClust, y=p, by.x = 'cluster', by.y='factorName', all.x=T)
dfClust <- dfClust[rev(order(dfClust$binCount)),]
names(dfClust) <-  c('cluster','modelname')
head (dfClust[c('cluster','modelname')],50)
##     cluster                           modelname
## 38       16                 960 Automatic 4-spd
## 37       16                  90 Automatic 4-spd
## 36       16                 940 Automatic 4-spd
## 35       16                 900 Automatic 4-spd
## 34       16                E500 Automatic 4-spd
## 33       16                 100 Automatic 4-spd
## 32       16                9000 Automatic 4-spd
## 31       16                 850 Automatic 4-spd
## 27       14                 G20 Automatic 4-spd
## 26       14                 240 Automatic 4-spd
## 25       14                S420 Automatic 4-spd
## 24       14               240SX Automatic 4-spd
## 23       14                C280 Automatic 4-spd
## 22       14                C220 Automatic 4-spd
## 21       14                E420 Automatic 4-spd
## 221     120           960 Wagon Automatic 4-spd
## 220     120          E320 Wagon Automatic 4-spd
## 219     120           850 Wagon Automatic 4-spd
## 218     120           240 Wagon Automatic 4-spd
## 217     120           100 Wagon Automatic 4-spd
## 216     120           940 Wagon Automatic 4-spd
## 185      99                     SW Manual 5-spd
## 184      99                     S4 Manual 5-spd
## 183      99                     S6 Manual 5-spd
## 182      99                    NSX Manual 5-spd
## 181      99                     SL Manual 5-spd
## 180      99                     SC Manual 5-spd
## 248     129    Ram 1500 Pickup 4WD Manual 5-spd
## 247     129    Ram 2500 Pickup 2WD Manual 5-spd
## 246     129    Ram 2500 Pickup 4WD Manual 5-spd
## 245     129    Ram 1500 Pickup 2WD Manual 5-spd
## 244     129      Ram 50 Pickup 2WD Manual 5-spd
## 243     128 Ram 1500 Pickup 2WD Automatic 4-spd
## 242     128 Ram 2500 Pickup 4WD Automatic 4-spd
## 241     128   Ram 50 Pickup 2WD Automatic 4-spd
## 240     128 Ram 1500 Pickup 4WD Automatic 4-spd
## 239     128 Ram 2500 Pickup 2WD Automatic 4-spd
## 177      97                 NSX Automatic 4-spd
## 176      97                  SC Automatic 4-spd
## 175      97                 SVX Automatic 4-spd
## 174      97                  SW Automatic 4-spd
## 173      97                  SL Automatic 4-spd
## 154      83               SL600 Automatic 4-spd
## 153      83              500SEL Automatic 4-spd
## 152      83               SL500 Automatic 4-spd
## 151      83              400SEL Automatic 4-spd
## 150      83               500SL Automatic 4-spd
## 47       18                540i Automatic 5-spd
## 46       18               840ci Automatic 5-spd
## 45       18               740il Automatic 5-spd



Conclusion

stringdistmatrix is a very flexible function with many tunable features. The cluster size, the algorithm, the concatenation of text with text and/or numbers create numerous and mind-boggling possibilities. Even with all these settings it is still so much easier than creating supersets by hand! String distance matching is case sensitive so you may get better groups if you force all text to once case. Have fun with this…




Full source code (also on GitHub):


# get the Hadley Wickham's vehicles data set
library(RCurl)
urlfile <-'https://raw.githubusercontent.com/hadley/fueleconomy/master/data-raw/vehicles.csv'
x <- getURL(urlfile, ssl.verifypeer = FALSE)
vehicles <- read.csv(textConnection(x))

# size the data
nrow(vehicles)
length(unique(vehicles$model))

# get a small sample for starters
vehicles_small <- vehicles[1:100,]
length(unique(vehicles_small$model))
head(unique(as.character(vehicles_small$model)))

# call the stringdistmatrix function and request 20 groups
library(stringdist)
uniquemodels <- unique(as.character(vehicles_small$model))
distancemodels <- stringdistmatrix(uniquemodels,uniquemodels,method = "jw")
rownames(distancemodels) <- uniquemodels
hc <- hclust(as.dist(distancemodels))

# visualize the dendrogram
plot(hc)
rect.hclust(hc,k=20)
 
# get a bigger sample
vehicles_small <- vehicles[1:2000,]
length(unique(vehicles_small$model))

# run the stringdistmatrix function and request 200 groups
uniquemodels <- unique(as.character(vehicles_small$model))
distancemodels <- stringdistmatrix(uniquemodels,uniquemodels,method = "jw")
rownames(distancemodels) <- uniquemodels
hc <- hclust(as.dist(distancemodels))
dfClust <- data.frame(uniquemodels, cutree(hc, k=200))
names(dfClust) <- c('modelname','cluster')

# visualize the groupings
plot(table(dfClust$cluster))
print(paste('Average number of models per cluster:', mean(table(dfClust$cluster))))

# lets look at the top groups and see what the algorithm did:
t <- table(dfClust$cluster)
t <- cbind(t,t/length(dfClust$cluster))
t <- t[order(t[,2], decreasing=TRUE),]
p <- data.frame(factorName=rownames(t), binCount=t[,1], percentFound=t[,2])
dfClust <- merge(x=dfClust, y=p, by.x = 'cluster', by.y='factorName', all.x=T)
dfClust <- dfClust[rev(order(dfClust$binCount)),]
names(dfClust) <-  c('cluster','modelname')
head (dfClust[c('cluster','modelname')],50)

# try combining fields together
vehicles_small$modelAndTrany <- paste0(as.character(vehicles_small$model)," ",as.character(vehicles_small$trany))
print(length(unique(vehicles_small$modelAndTrany)))

uniquemodels <- unique(as.character(vehicles_small$modelAndTrany))
distancemodels <- stringdistmatrix(uniquemodels,uniquemodels,method = "jw")
rownames(distancemodels) <- uniquemodels
hc <- hclust(as.dist(distancemodels))
dfClust <- data.frame(uniquemodels, cutree(hc, k=500))
names(dfClust) <- c('modelname','cluster')
t <- table(dfClust$cluster)
t <- cbind(t,t/length(dfClust$cluster))
t <- t[order(t[,2], decreasing=TRUE),]
p <- data.frame(factorName=rownames(t), binCount=t[,1], percentFound=t[,2])
dfClust <- merge(x=dfClust, y=p, by.x = 'cluster', by.y='factorName', all.x=T)
dfClust <- dfClust[rev(order(dfClust$binCount)),]
names(dfClust) <-  c('cluster','modelname')
head (dfClust[c('cluster','modelname')],50)

# build a convenient function to do all of the above
GroupFactorsTogether <- function(objData, variableName, clustersize=200, method='jw') {
        #      osa: Optimal string aligment, (restricted Damerau-Levenshtein distance).
        #      lv: Levenshtein distance (as in R's native adist).
        #      dl: Full Damerau-Levenshtein distance.
        #      hamming: Hamming distance (a and b must have same nr of characters).
        #      lcs: Longest common substring distance.
        #      qgram: q-gram distance.
        #      cosine: cosine distance between q-gram profiles
        #      jaccard: Jaccard distance between q-gram profiles
        #      jw: Jaro, or Jaro-Winker distance.
        #      soundex: Distance based on soundex encoding

        #       stringdistmatrix(a, b, method = c("osa", "lv", "dl", "hamming", "lcs",
        #               "qgram", "cosine", "jaccard", "jw", useBytes = FALSE,
        #               weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0,
        #               useNames = FALSE, ncores = 1, cluster = NULL)
        #               require(stringdist)

        str <- unique(as.character(objData[,variableName]))
        print(paste('Uniques:',length(str)))

        d <- stringdistmatrix(str,str,method = c(method))

        rownames(d) <- str
        hc <- hclust(as.dist(d))

        dfClust <- data.frame(str, cutree(hc, k=clustersize))

        plot(table(dfClust$'cutree.hc..k...k.'))

        most_populated_clusters <- dfClust[dfClust$'cutree.hc..k...k.' > 5,]
        names(most_populated_clusters) <- c('entry','cluster')

        # sort by most frequent
        t <- table(most_populated_clusters$cluster)
        t <- cbind(t,t/length(most_populated_clusters$cluster))
        t <- t[order(t[,2], decreasing=TRUE),]
        p <- data.frame(factorName=rownames(t), binCount=t[,1], percentFound=t[,2])
        most_populated_clusters <- merge(x=most_populated_clusters, y=p, by.x = 'cluster', by.y='factorName', all.x=T)
        most_populated_clusters <- most_populated_clusters[rev(order(most_populated_clusters$binCount)),]
        names(most_populated_clusters) <-  c('cluster','entry')
        return (most_populated_clusters[c('cluster','entry')])
}