Where Are Your Customers Coming From And Where Are They Going - Reporting On Complex Customer Behavior In Plain English With C5.0
Practical walkthroughs on machine learning, data exploration and finding insight.
Resources
At SpringML, we are known for our customer-facing, colorful, interactive and informative dashboards, and we are very proud of them. As you can imagine, it takes time to build those dashboards, but it also requires a lot of research to decide what to feed into them - what will be the most empowering and pertinent to the customer. In that spirit, we are constantly evaluating new techniques and new tools to better dive into the data and unearth more patterns and bottlenecks.
The C5.0 Model (C5.0 Decision Trees and Rule-Based Models) is one of those tools. It is a high performing, tree-based classification model. Much like a random forest, it takes random sets of features and measures how they affect a particular outcome. C5.0 outputs a set of complex, non-linear rules describing what features and at what level provide the most lift to the model. It doesn’t even require dummy variables from your categorical data, on the contrary, the closer your features are to the spoken language, the more readable the output. With some prep-work, we can automate the reporting of C5.0 results using a structure similar to the spoken language, something that anybody can understand - no statistics degree required!
Why Are Your Customers Churning?
We’ll use the built-in data set supplied with the C5.0 R library, called Customer Churn
. As the name implies, the data contains customer information and usage records from a phone company including whether the customer churned or not. Here is a brief sample of the distilled intelligence you can expect from C5.0 - though this is a toy data set, it would seem that full day use, international plans, or customer service calls to be patterns for churn:
Out of 489 cases, 82% churned when:
- total_night_minutes > 174
- total_eve_minutes > 241
- total_day_minutes > 224
- international_plan == yes
- total_day_charge < 38.25
- number_customer_service_calls > 3
This data set contains two data frames, a training (3333 entries) and testing (1667 entries). We’ll use churnTrain
to keep things simple and fix the outcome variable churn
to be more readable:
# install.packages('C50')
library(C50)
data(churn)
churn_data <- churnTrain
outcome_name <- 'churn'
# make the outcome variable easier to read
churn_data[,outcome_name] <- as.factor(ifelse(churn_data[,outcome_name]=='yes','Does_Churn', 'Stays'))
str(churn_data)
## 'data.frame': 3333 obs. of 20 variables:
## $ state : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
## $ account_length : int 128 107 137 84 75 118 121 147 117 141 ...
## $ area_code : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
## $ international_plan : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 1 2 ...
## $ voice_mail_plan : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 2 1 1 2 ...
## $ number_vmail_messages : int 25 26 0 0 0 0 24 0 0 37 ...
## $ total_day_minutes : num 265 162 243 299 167 ...
## $ total_day_calls : int 110 123 114 71 113 98 88 79 97 84 ...
## $ total_day_charge : num 45.1 27.5 41.4 50.9 28.3 ...
## $ total_eve_minutes : num 197.4 195.5 121.2 61.9 148.3 ...
## $ total_eve_calls : int 99 103 110 88 122 101 108 94 80 111 ...
## $ total_eve_charge : num 16.78 16.62 10.3 5.26 12.61 ...
## $ total_night_minutes : num 245 254 163 197 187 ...
## $ total_night_calls : int 91 103 104 89 121 118 118 96 90 97 ...
## $ total_night_charge : num 11.01 11.45 7.32 8.86 8.41 ...
## $ total_intl_minutes : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
## $ total_intl_calls : int 3 3 5 7 3 6 7 6 4 5 ...
## $ total_intl_charge : num 2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
## $ number_customer_service_calls: int 1 1 0 2 3 0 3 0 1 0 ...
## $ churn : Factor w/ 2 levels "Does_Churn","Stays": 2 2 2 2 2 2 2 2 2 2 ...
Hi there, this is Manuel Amunategui- if you're enjoying the content, find more at ViralML.com
Straight up C5.0
Let’s first look at the default summary output from C5.0:
c50_model <- C5.0(x=churn_data[,setdiff(names(churn_data), outcome_name)], y=churn_data[,outcome_name])
summary(c50_model)
##
## Call:
## C5.0.default(x = churn_data[, setdiff(names(churn_data), outcome_name)],
## y = churn_data[, outcome_name], rules = FALSE)
##
##
## C5.0 [Release 2.07 GPL Edition] Tue Dec 27 10:01:02 2016
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 3333 cases (20 attributes) from undefined.data
##
## Decision tree:
##
## total_day_minutes > 264.4:
## :...voice_mail_plan = yes:
## : :...international_plan = no: Stays (45/1)
## : : international_plan = yes: Does_Churn (8/3)
## : voice_mail_plan = no:
## : :...total_eve_minutes > 187.7:
## : :...total_night_minutes > 126.9: Does_Churn (94/1)
## : : total_night_minutes <= 126.9:
## : : :...total_day_minutes <= 277: Stays (4)
## : : total_day_minutes > 277: Does_Churn (3)
## : total_eve_minutes <= 187.7:
## : :...total_eve_charge <= 12.26: Stays (15/1)
## : total_eve_charge > 12.26:
## : :...total_day_minutes <= 277:
## : :...total_night_minutes <= 224.8: Stays (13)
## : : total_night_minutes > 224.8: Does_Churn (5/1)
## : total_day_minutes > 277:
## : :...total_night_minutes > 151.9: Does_Churn (18)
## : total_night_minutes <= 151.9:
## : :...account_length <= 123: Stays (4)
## : account_length > 123: Does_Churn (2)
## total_day_minutes <= 264.4:
## :...number_customer_service_calls > 3:
## :...total_day_minutes <= 160.2:
## : :...total_eve_charge <= 19.83: Does_Churn (79/3)
## : : total_eve_charge > 19.83:
## : : :...total_day_minutes <= 120.5: Does_Churn (10)
## : : total_day_minutes > 120.5: Stays (13/3)
## : total_day_minutes > 160.2:
## : :...total_eve_charge > 12.05: Stays (130/24)
## : total_eve_charge <= 12.05:
## : :...total_eve_calls <= 125: Does_Churn (16/2)
## : total_eve_calls > 125: Stays (3)
## number_customer_service_calls <= 3:
## :...international_plan = yes:
## :...total_intl_calls <= 2: Does_Churn (51)
## : total_intl_calls > 2:
## : :...total_intl_minutes <= 13.1: Stays (173/7)
## : total_intl_minutes > 13.1: Does_Churn (43)
## international_plan = no:
## :...total_day_minutes <= 223.2: Stays (2221/60)
## total_day_minutes > 223.2:
## :...total_eve_charge <= 20.5: Stays (295/22)
## total_eve_charge > 20.5:
## :...voice_mail_plan = yes: Stays (20)
## voice_mail_plan = no:
## :...total_night_minutes > 174.2: Does_Churn (50/8)
## total_night_minutes <= 174.2:
## :...total_day_minutes <= 246.6: Stays (12)
## total_day_minutes > 246.6:
## :...total_day_charge <= 43.33: Does_Churn (4)
## total_day_charge > 43.33: Stays (2)
##
##
## Evaluation on training data (3333 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 27 136( 4.1%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 365 118 (a): class Does_Churn
## 18 2832 (b): class Stays
##
##
## Attribute usage:
##
## 100.00% total_day_minutes
## 93.67% number_customer_service_calls
## 87.73% international_plan
## 20.73% total_eve_charge
## 8.97% voice_mail_plan
## 8.01% total_intl_calls
## 6.48% total_intl_minutes
## 6.33% total_night_minutes
## 4.74% total_eve_minutes
## 0.57% total_eve_calls
## 0.18% account_length
## 0.18% total_day_charge
##
##
## Time: 0.0 secs
Note from C5.0: An Informal Tutorial: "For instance, the last leaf of the decision tree is compensated (174.6/24.8), for which n is 174.6 and m is 24.8. The value of n is the number of cases in the file hypothyroid.data that are mapped to this leaf, and m (if it appears) is the number of them that are classified incorrectly by the leaf. (A non-integral number of cases can arise because, when the value of an attribute in the tree is not known, C5.0 splits the case and sends a fraction down each branch.)"
C5.0: An Informal Tutorial
Now, if we turn on rules
to TRUE (see ?C5.0: A logical: should the tree be decomposed into a rule-based model?). We get the same output but in a more readable format and that is going to help us get this information into a structured format:
c50_model <- C5.0(x=churn_data[,setdiff(names(churn_data), outcome_name)], y=churn_data[,outcome_name], rules = TRUE)
summary(c50_model)
##
## Call:
## C5.0.default(x = churn_data[, setdiff(names(churn_data), outcome_name)],
## y = churn_data[, outcome_name], rules = TRUE)
##
##
## C5.0 [Release 2.07 GPL Edition] Tue Dec 27 10:01:02 2016
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 3333 cases (20 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (60, lift 6.8)
## international_plan = yes
## total_intl_calls <= 2
## -> class Does_Churn [0.984]
##
## Rule 2: (57, lift 6.8)
## international_plan = yes
## total_intl_minutes > 13.1
## -> class Does_Churn [0.983]
##
## Rule 3: (32, lift 6.7)
## total_day_minutes <= 120.5
## number_customer_service_calls > 3
## -> class Does_Churn [0.971]
##
## Rule 4: (79/3, lift 6.6)
## total_day_minutes <= 160.2
## total_eve_charge <= 19.83
## number_customer_service_calls > 3
## -> class Does_Churn [0.951]
##
## Rule 5: (43/2, lift 6.4)
## international_plan = no
## voice_mail_plan = no
## total_day_minutes > 246.6
## total_eve_charge > 20.5
## -> class Does_Churn [0.933]
##
## Rule 6: (28/2, lift 6.2)
## total_day_minutes <= 264.4
## total_eve_calls <= 125
## total_eve_charge <= 12.05
## number_customer_service_calls > 3
## -> class Does_Churn [0.900]
##
## Rule 7: (78/8, lift 6.1)
## voice_mail_plan = no
## total_day_minutes > 223.2
## total_eve_charge > 20.5
## total_night_minutes > 174.2
## -> class Does_Churn [0.888]
##
## Rule 8: (114/24, lift 5.4)
## voice_mail_plan = no
## total_day_minutes > 223.2
## total_eve_charge > 20.5
## -> class Does_Churn [0.784]
##
## Rule 9: (152/58, lift 4.3)
## total_day_minutes > 223.2
## total_eve_charge > 20.5
## -> class Does_Churn [0.617]
##
## Rule 10: (211/84, lift 4.1)
## total_day_minutes > 264.4
## -> class Does_Churn [0.601]
##
## Rule 11: (2221/60, lift 1.1)
## international_plan = no
## total_day_minutes <= 223.2
## number_customer_service_calls <= 3
## -> class Stays [0.973]
##
## Rule 12: (768/20, lift 1.1)
## international_plan = no
## voice_mail_plan = yes
## number_customer_service_calls <= 3
## -> class Stays [0.973]
##
## Rule 13: (140/5, lift 1.1)
## account_length <= 123
## total_eve_minutes <= 187.7
## total_night_minutes <= 151.9
## -> class Stays [0.958]
##
## Rule 14: (45/1, lift 1.1)
## international_plan = no
## voice_mail_plan = yes
## total_day_minutes > 264.4
## -> class Stays [0.957]
##
## Rule 15: (1972/87, lift 1.1)
## total_day_minutes <= 264.4
## total_intl_minutes <= 13.1
## total_intl_calls > 2
## number_customer_service_calls <= 3
## -> class Stays [0.955]
##
## Rule 16: (197/9, lift 1.1)
## total_day_minutes > 120.5
## total_day_minutes <= 160.2
## total_eve_charge > 19.83
## -> class Stays [0.950]
##
## Rule 17: (155/10, lift 1.1)
## voice_mail_plan = no
## total_day_minutes <= 277
## total_night_minutes <= 126.9
## -> class Stays [0.930]
##
## Rule 18: (1675/185, lift 1.0)
## total_day_minutes > 160.2
## total_day_minutes <= 264.4
## total_eve_charge > 12.05
## -> class Stays [0.889]
##
## Rule 19: (434/49, lift 1.0)
## total_eve_charge <= 12.26
## -> class Stays [0.885]
##
## Default class: Stays
##
##
## Evaluation on training data (3333 cases):
##
## Rules
## ----------------
## No Errors
##
## 19 146( 4.4%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 371 112 (a): class Does_Churn
## 34 2816 (b): class Stays
##
##
## Attribute usage:
##
## 98.23% total_day_minutes
## 84.61% number_customer_service_calls
## 75.73% international_plan
## 71.83% total_eve_charge
## 60.97% total_intl_calls
## 60.88% total_intl_minutes
## 31.02% voice_mail_plan
## 10.11% total_night_minutes
## 4.20% account_length
## 4.20% total_eve_minutes
## 0.84% total_eve_calls
##
##
## Time: 0.0 secs
Let’s analyze one of the rules:
Rule x: (152/58, lift 4.3)
total_day_minutes > 223.2
total_eve_charge > 20.5
-> class Does_Churn [0.617]
It states that 152 cases with total_day_minutes > 223.2
and total_eve_charge > 20.5
led to churn==Does_Churn
. There were 152
cases fulling the above condition where 58
failed to lead to the stated outcome. This rule provided a lift of 4.3 in predicting churn
.
Automating a Report
Though the output is interesting and insightful, we need this in a data-frame format so we can build it into our reporting pipeline. Let’s see what we have to work with:
head(c50_model$rules)
## [1] "id=\"See5/C5.0 2.07 GPL Edition 2016-12-27\"\nentries=\"1\"\nrules=\"19\" default=\"Stays\"\nconds=\"2\" cover=\"60\" ok=\"60\" lift=\"6.78932\" class=\"Does_Churn\"\ntype=\"1\" att=\"international_plan\" val=\"yes\"\ntype=\"2\" att=\"total_intl_calls\" cut=\"2\" result=\"<\"\nconds=\"2\" cover=\"57\" ok=\"57\" lift=\"6.78366\" class=\"Does_Churn\"\ntype=\"1\" att=\"international_plan\" val=\"yes\"\ntype=\"2\" att=\"total_intl_minutes\" cut=\"13.1\" result=\">\"\nconds=\"2\" cover=\"32\" ok=\"32\" lift=\"6.69766\" class=\"Does_Churn\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"120.5\" result=\"<\"\ntype=\"2\" att=\"number_customer_service_calls\" cut=\"3\" result=\">\"\nconds=\"3\" cover=\"79\" ok=\"76\" lift=\"6.55985\" class=\"Does_Churn\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"160.2\" result=\"<\"\ntype=\"2\" att=\"total_eve_charge\" cut=\"19.83\" result=\"<\"\ntype=\"2\" att=\"number_customer_service_calls\" cut=\"3\" result=\">\"\nconds=\"4\" cover=\"43\" ok=\"41\" lift=\"6.44058\" class=\"Does_Churn\"\ntype=\"1\" att=\"international_plan\" val=\"no\"\ntype=\"1\" att=\"voice_mail_plan\" val=\"no\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"246.60001\" result=\">\"\ntype=\"2\" att=\"total_eve_charge\" cut=\"20.5\" result=\">\"\nconds=\"4\" cover=\"28\" ok=\"26\" lift=\"6.21056\" class=\"Does_Churn\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"264.39999\" result=\"<\"\ntype=\"2\" att=\"total_eve_calls\" cut=\"125\" result=\"<\"\ntype=\"2\" att=\"total_eve_charge\" cut=\"12.05\" result=\"<\"\ntype=\"2\" att=\"number_customer_service_calls\" cut=\"3\" result=\">\"\nconds=\"4\" cover=\"78\" ok=\"70\" lift=\"6.1243\" class=\"Does_Churn\"\ntype=\"1\" att=\"voice_mail_plan\" val=\"no\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"223.2\" result=\">\"\ntype=\"2\" att=\"total_eve_charge\" cut=\"20.5\" result=\">\"\ntype=\"2\" att=\"total_night_minutes\" cut=\"174.2\" result=\">\"\nconds=\"3\" cover=\"114\" ok=\"90\" lift=\"5.41342\" class=\"Does_Churn\"\ntype=\"1\" att=\"voice_mail_plan\" val=\"no\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"223.2\" result=\">\"\ntype=\"2\" att=\"total_eve_charge\" cut=\"20.5\" result=\">\"\nconds=\"2\" cover=\"152\" ok=\"94\" lift=\"4.25688\" class=\"Does_Churn\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"223.2\" result=\">\"\ntype=\"2\" att=\"total_eve_charge\" cut=\"20.5\" result=\">\"\nconds=\"1\" cover=\"211\" ok=\"127\" lift=\"4.14685\" class=\"Does_Churn\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"264.39999\" result=\">\"\nconds=\"3\" cover=\"2221\" ok=\"2161\" lift=\"1.13738\" class=\"Stays\"\ntype=\"1\" att=\"international_plan\" val=\"no\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"223.2\" result=\"<\"\ntype=\"2\" att=\"number_customer_service_calls\" cut=\"3\" result=\"<\"\nconds=\"3\" cover=\"768\" ok=\"748\" lift=\"1.13758\" class=\"Stays\"\ntype=\"1\" att=\"international_plan\" val=\"no\"\ntype=\"1\" att=\"voice_mail_plan\" val=\"yes\"\ntype=\"2\" att=\"number_customer_service_calls\" cut=\"3\" result=\"<\"\nconds=\"3\" cover=\"140\" ok=\"135\" lift=\"1.12006\" class=\"Stays\"\ntype=\"2\" att=\"account_length\" cut=\"123\" result=\"<\"\ntype=\"2\" att=\"total_eve_minutes\" cut=\"187.7\" result=\"<\"\ntype=\"2\" att=\"total_night_minutes\" cut=\"151.89999\" result=\"<\"\nconds=\"3\" cover=\"45\" ok=\"44\" lift=\"1.11971\" class=\"Stays\"\ntype=\"1\" att=\"international_plan\" val=\"no\"\ntype=\"1\" att=\"voice_mail_plan\" val=\"yes\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"264.39999\" result=\">\"\nconds=\"4\" cover=\"1972\" ok=\"1885\" lift=\"1.11734\" class=\"Stays\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"264.39999\" result=\"<\"\ntype=\"2\" att=\"total_intl_minutes\" cut=\"13.1\" result=\"<\"\ntype=\"2\" att=\"total_intl_calls\" cut=\"2\" result=\">\"\ntype=\"2\" att=\"number_customer_service_calls\" cut=\"3\" result=\"<\"\nconds=\"3\" cover=\"197\" ok=\"188\" lift=\"1.11071\" class=\"Stays\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"120.5\" result=\">\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"160.2\" result=\"<\"\ntype=\"2\" att=\"total_eve_charge\" cut=\"19.83\" result=\">\"\nconds=\"3\" cover=\"155\" ok=\"145\" lift=\"1.08754\" class=\"Stays\"\ntype=\"1\" att=\"voice_mail_plan\" val=\"no\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"277\" result=\"<\"\ntype=\"2\" att=\"total_night_minutes\" cut=\"126.9\" result=\"<\"\nconds=\"3\" cover=\"1675\" ok=\"1490\" lift=\"1.03976\" class=\"Stays\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"160.2\" result=\">\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"264.39999\" result=\"<\"\ntype=\"2\" att=\"total_eve_charge\" cut=\"12.05\" result=\">\"\nconds=\"1\" cover=\"434\" ok=\"385\" lift=\"1.03536\" class=\"Stays\"\ntype=\"2\" att=\"total_eve_charge\" cut=\"12.26\" result=\"<\"\n"
Yes, its a mess! But don’t despair, we’ll strsplit
our way out of this and make it all come together:
rule_munger <- capture.output(c50_model$rules , split = TRUE)
rule_munger <- strsplit(rule_munger,'\\\\n')
rule_munger <- gsub(x = rule_munger[[1]], pattern = '\\\\|\"', replacement = '')[-1]
head(rule_munger,6)
## [1] "entries=1"
## [2] "rules=19 default=Stays"
## [3] "conds=2 cover=60 ok=60 lift=6.78932 class=Does_Churn"
## [4] "type=1 att=international_plan val=yes"
## [5] "type=2 att=total_intl_calls cut=2 result=<"
## [6] "conds=2 cover=57 ok=57 lift=6.78366 class=Does_Churn"
Much better, right?
The output is a bit different so let’s look at it again. It states that C5.0 found 19 rules
in total. The first rule has 2 conditions with a cover
of 60 (60 rows found with 100% success rate) and a lift
of 6.78932 for predicting churn==yes
. The type
represents the type of condition (like numeric, logical, categorical, etc). The first condition has an attribute value of international_pal==yes
and total_intl_calls < 2
. There you have it, a rule that is practically readable.
Reporting Magic
To make this easier on all of us, we’re going to build a function called interesting_interactions
that will format this automatically and return a compact data frame.
interesting_interactions <- function(the_data_frame, outcome_name) {
# install.packages(...) if missing
require(C50)
require(dplyr)
c5model <- C5.0(
x = the_data_frame[,setdiff(names(the_data_frame), outcome_name)],
y = the_data_frame[,outcome_name],
rules = TRUE
)
rule_munger <- capture.output(c5model$rules , split = TRUE)
rule_munger <- strsplit(rule_munger,'\\\\n')
rule_munger <- gsub(x = rule_munger[[1]], pattern = '\\\\|\"', replacement = '')[-1]
# extract results into data frame format
rule_count <- 0
conds_last <- 0
cover_last <- 0
ok_last <- 0
lift_last <- 0
class_last <- 0
rules <- c()
for (entry in rule_munger) {
print(entry)
if (substr(entry,1,5) == 'rules')
print(entry)
# track only lines starting with conds or type - ignore rest
if (substr(entry,1,5) == 'conds' |
substr(entry,1,4) == 'type') {
if (substr(entry,1,5) == 'conds') {
rule_count <- rule_count + 1
conds_last <-
strsplit(x = strsplit(x = entry, split = " ")[[1]][1], split = '=')[[1]][2]
# cover is the number of training cases covered by the rule
cover_last <-
strsplit(x = strsplit(x = entry, split = " ")[[1]][2], split = '=')[[1]][2]
# ok is the number of positives covered by class,
ok_last <-
strsplit(x = strsplit(x = entry, split = " ")[[1]][3], split = '=')[[1]][2]
# lift is the estimated accuracy of the rule
lift_last <-
strsplit(x = strsplit(x = entry, split = " ")[[1]][4], split = '=')[[1]][2]
# class predicted by
class_last <-
strsplit(x = strsplit(x = entry, split = " ")[[1]][5], split = '=')[[1]][2]
}
if (substr(entry,1,4) == 'type') {
# variable type
type_last <-
strsplit(x = strsplit(x = entry, split = " ")[[1]][1], split = '=')[[1]][2]
att_last <-
strsplit(x = strsplit(x = entry, split = " ")[[1]][2], split = '=')[[1]][2]
# sniff out optional parameters
elts_last <- ''
if (grepl(x = entry, pattern = 'elts')) {
elts_last <- strsplit(x = entry, split = "elts=")[[1]][2]
}
cut_last <- ''
if (grepl(x = entry, pattern = 'cut')) {
cut_last <-
strsplit(
x = strsplit(
x = entry, split = "cut="
)[[1]][2], split = ' '
)[[1]][1]
}
val_last <- ''
if (grepl(x = entry, pattern = 'val')) {
val_last <- strsplit(x = entry, split = "val=")[[1]][2]
}
result_last <- ''
if (grepl(x = entry, pattern = 'result')) {
result_last <- strsplit(x = entry, split = "result=")[[1]][2]
}
rules <- rbind(
rules, c(
rule_count,
conds_last,
cover_last,
ok_last,
lift_last,
type_last,
att_last,
elts_last,
result_last,
cut_last,
val_last,
class_last
)
)
}
}
}
if (!is.null(rules)) {
rules <- data.frame(rules)
names(rules) <-
c(
'rule_number', 'conditions', 'cover', 'true_pos',
'lift', 'type', 'attribute', 'elts', 'cut', 'result',
'value', 'outcome'
)
rules[, 1:6] <- sapply(rules[, 1:6], as.character)
rules[, 1:6] <- sapply(rules[, 1:6], as.numeric)
if (length(unique(rules$rule_number) > 0)) {
rules %>% dplyr::arrange(desc(lift)) -> rules
}
}
return (rules)
}
Now that we built this function, let’s take it for a spin and see what it what we can learn about our churning and non-churning customers.
results <- interesting_interactions(the_data_frame = churn_data, outcome_name = outcome_name)
print(head(results, 10))
## rule_number conditions cover true_pos lift type
## 1 1 2 60 60 6.78932 1
## 2 1 2 60 60 6.78932 2
## 3 2 2 57 57 6.78366 1
## 4 2 2 57 57 6.78366 2
## 5 3 2 32 32 6.69766 2
## 6 3 2 32 32 6.69766 2
## 7 4 3 79 76 6.55985 2
## 8 4 3 79 76 6.55985 2
## 9 4 3 79 76 6.55985 2
## 10 5 4 43 41 6.44058 1
## attribute elts cut result value outcome
## 1 international_plan yes Does_Churn
## 2 total_intl_calls < 2 Does_Churn
## 3 international_plan yes Does_Churn
## 4 total_intl_minutes > 13.1 Does_Churn
## 5 total_day_minutes < 120.5 Does_Churn
## 6 number_customer_service_calls > 3 Does_Churn
## 7 total_day_minutes < 160.2 Does_Churn
## 8 total_eve_charge < 19.83 Does_Churn
## 9 number_customer_service_calls > 3 Does_Churn
## 10 international_plan no Does_Churn
We can take this one step further and build a report-generating function. Something that can peel the conditions apart and weave them into a sentence:
# rules report
print_rules <- function(rules_found, rulenum) {
print('')
print(paste0('Rule #', rulenum))
dplyr::filter(rules_found, rule_number == rulenum) -> pulled_rule
dplyr::select(pulled_rule, cover, true_pos, outcome) %>% head(1) -> rule_def
dplyr::select(pulled_rule, attribute, elts, cut, result, value) -> conditions
print(paste0('In ', rule_def$cover, ' cases, ', round(rule_def$true_pos/rule_def$cover,2)*100, '% customers ',
as.character(rule_def$outcome),' when:'))
for (cond_id in seq(nrow(conditions))) {
cond <- conditions[cond_id,]
#attribute elts cut result value
if (nchar(as.character(cond$elts)) > 0) {
print(paste0(cond$attribute,
': ', cond$elts))
} else if (nchar(as.character(cond$value)) > 0) {
print(paste0(cond$attribute,
' == ', cond$value))
} else {
print(paste0(cond$attribute, " ", cond$cut, " ", cond$result))
}
}
print('')
}
# collate interesting rules into report format - top x rules found sorted by lift
for (rule_number in unique(results$rule_number))
print_rules(results, rule_number)
## [1] ""
## [1] "Rule #1"
## [1] "In 60 cases, 100% customers Does_Churn when:"
## [1] "international_plan == yes"
## [1] "total_intl_calls < 2"
## [1] ""
## [1] ""
## [1] "Rule #2"
## [1] "In 57 cases, 100% customers Does_Churn when:"
## [1] "international_plan == yes"
## [1] "total_intl_minutes > 13.1"
## [1] ""
## [1] ""
## [1] "Rule #3"
## [1] "In 32 cases, 100% customers Does_Churn when:"
## [1] "total_day_minutes < 120.5"
## [1] "number_customer_service_calls > 3"
## [1] ""
## [1] ""
## [1] "Rule #4"
## [1] "In 79 cases, 96% customers Does_Churn when:"
## [1] "total_day_minutes < 160.2"
## [1] "total_eve_charge < 19.83"
## [1] "number_customer_service_calls > 3"
## [1] ""
## [1] ""
## [1] "Rule #5"
## [1] "In 43 cases, 95% customers Does_Churn when:"
## [1] "international_plan == no"
## [1] "voice_mail_plan == no"
## [1] "total_day_minutes > 246.60001"
## [1] "total_eve_charge > 20.5"
## [1] ""
## [1] ""
## [1] "Rule #6"
## [1] "In 28 cases, 93% customers Does_Churn when:"
## [1] "total_day_minutes < 264.39999"
## [1] "total_eve_calls < 125"
## [1] "total_eve_charge < 12.05"
## [1] "number_customer_service_calls > 3"
## [1] ""
## [1] ""
## [1] "Rule #7"
## [1] "In 78 cases, 90% customers Does_Churn when:"
## [1] "voice_mail_plan == no"
## [1] "total_day_minutes > 223.2"
## [1] "total_eve_charge > 20.5"
## [1] "total_night_minutes > 174.2"
## [1] ""
## [1] ""
## [1] "Rule #8"
## [1] "In 114 cases, 79% customers Does_Churn when:"
## [1] "voice_mail_plan == no"
## [1] "total_day_minutes > 223.2"
## [1] "total_eve_charge > 20.5"
## [1] ""
## [1] ""
## [1] "Rule #9"
## [1] "In 152 cases, 62% customers Does_Churn when:"
## [1] "total_day_minutes > 223.2"
## [1] "total_eve_charge > 20.5"
## [1] ""
## [1] ""
## [1] "Rule #10"
## [1] "In 211 cases, 60% customers Does_Churn when:"
## [1] "total_day_minutes > 264.39999"
## [1] ""
## [1] ""
## [1] "Rule #12"
## [1] "In 768 cases, 97% customers Stays when:"
## [1] "international_plan == no"
## [1] "voice_mail_plan == yes"
## [1] "number_customer_service_calls < 3"
## [1] ""
## [1] ""
## [1] "Rule #11"
## [1] "In 2221 cases, 97% customers Stays when:"
## [1] "international_plan == no"
## [1] "total_day_minutes < 223.2"
## [1] "number_customer_service_calls < 3"
## [1] ""
## [1] ""
## [1] "Rule #13"
## [1] "In 140 cases, 96% customers Stays when:"
## [1] "account_length < 123"
## [1] "total_eve_minutes < 187.7"
## [1] "total_night_minutes < 151.89999"
## [1] ""
## [1] ""
## [1] "Rule #14"
## [1] "In 45 cases, 98% customers Stays when:"
## [1] "international_plan == no"
## [1] "voice_mail_plan == yes"
## [1] "total_day_minutes > 264.39999"
## [1] ""
## [1] ""
## [1] "Rule #15"
## [1] "In 1972 cases, 96% customers Stays when:"
## [1] "total_day_minutes < 264.39999"
## [1] "total_intl_minutes < 13.1"
## [1] "total_intl_calls > 2"
## [1] "number_customer_service_calls < 3"
## [1] ""
## [1] ""
## [1] "Rule #16"
## [1] "In 197 cases, 95% customers Stays when:"
## [1] "total_day_minutes > 120.5"
## [1] "total_day_minutes < 160.2"
## [1] "total_eve_charge > 19.83"
## [1] ""
## [1] ""
## [1] "Rule #17"
## [1] "In 155 cases, 94% customers Stays when:"
## [1] "voice_mail_plan == no"
## [1] "total_day_minutes < 277"
## [1] "total_night_minutes < 126.9"
## [1] ""
## [1] ""
## [1] "Rule #18"
## [1] "In 1675 cases, 89% customers Stays when:"
## [1] "total_day_minutes > 160.2"
## [1] "total_day_minutes < 264.39999"
## [1] "total_eve_charge > 12.05"
## [1] ""
## [1] ""
## [1] "Rule #19"
## [1] "In 434 cases, 89% customers Stays when:"
## [1] "total_eve_charge < 12.26"
## [1] ""
Though the output is somewhat customized compared to the original C5.0 output - the beauty here is that you now hold all the rules in a structured format and can customize to your heart's content!
Thanks again for the artwork, Lucas!!
Manuel Amunategui - Follow me on Twitter: @amunategui