About Manuel Amunategui

Data scientist with over 20-years experience in the tech industry, MAs in Predictive Analytics and International Administration, co-author of Monetizing Machine Learning and VP of Data Science at SpringML.

From consulting in machine learning, healthcare modeling, 6 years on Wall Street in the financial industry, and 4 years at Microsoft, I feel like I’ve seen it all. And this has opened my eyes to the huge gap in educational material on applied data science. Like I say:

It just ain’t real 'til it reaches your customer’s plate

I am a startup advisor and available for speaking engagements with companies and schools on topics around building and motivating data science teams, and all things applied machine learning.

Reach me at amunategui@gmail.com

Data Exploration & Machine Learning, Hands-on

Recommended free walkthrough, check it out and boost your career:

Where Are Your Customers Coming From And Where Are They Going - Reporting On Complex Customer Behavior In Plain English With C5.0

Practical walkthroughs on machine learning, data exploration and finding insight.

C50 Globe

Resources

YouTube Companion Video

At SpringML, we are known for our customer-facing, colorful, interactive and informative dashboards, and we are very proud of them. As you can imagine, it takes time to build those dashboards, but it also requires a lot of research to decide what to feed into them - what will be the most empowering and pertinent to the customer. In that spirit, we are constantly evaluating new techniques and new tools to better dive into the data and unearth more patterns and bottlenecks.

The C5.0 Model (C5.0 Decision Trees and Rule-Based Models) is one of those tools. It is a high performing, tree-based classification model. Much like a random forest, it takes random sets of features and measures how they affect a particular outcome. C5.0 outputs a set of complex, non-linear rules describing what features and at what level provide the most lift to the model. It doesn’t even require dummy variables from your categorical data, on the contrary, the closer your features are to the spoken language, the more readable the output. With some prep-work, we can automate the reporting of C5.0 results using a structure similar to the spoken language, something that anybody can understand - no statistics degree required!

Why Are Your Customers Churning?

We’ll use the built-in data set supplied with the C5.0 R library, called Customer Churn. As the name implies, the data contains customer information and usage records from a phone company including whether the customer churned or not. Here is a brief sample of the distilled intelligence you can expect from C5.0 - though this is a toy data set, it would seem that full day use, international plans, or customer service calls to be patterns for churn:

Out of 489 cases, 82% churned when:

total_night_minutes > 174

total_eve_minutes > 241

total_day_minutes > 224

Out of 416 cases, 61% churned when:

international_plan == yes

Out of 348 cases, 70% churned when:

total_day_charge < 38.25

number_customer_service_calls > 3

This data set contains two data frames, a training (3333 entries) and testing (1667 entries). We’ll use churnTrain to keep things simple and fix the outcome variable churn to be more readable:


# install.packages('C50')
library(C50)
data(churn)

churn_data <- churnTrain
outcome_name <- 'churn'

# make the outcome variable easier to read
churn_data[,outcome_name] <- as.factor(ifelse(churn_data[,outcome_name]=='yes','Does_Churn', 'Stays'))

str(churn_data)

## 'data.frame':    3333 obs. of  20 variables:
##  $ state                        : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
##  $ account_length               : int  128 107 137 84 75 118 121 147 117 141 ...
##  $ area_code                    : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
##  $ international_plan           : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 1 2 ...
##  $ voice_mail_plan              : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 2 1 1 2 ...
##  $ number_vmail_messages        : int  25 26 0 0 0 0 24 0 0 37 ...
##  $ total_day_minutes            : num  265 162 243 299 167 ...
##  $ total_day_calls              : int  110 123 114 71 113 98 88 79 97 84 ...
##  $ total_day_charge             : num  45.1 27.5 41.4 50.9 28.3 ...
##  $ total_eve_minutes            : num  197.4 195.5 121.2 61.9 148.3 ...
##  $ total_eve_calls              : int  99 103 110 88 122 101 108 94 80 111 ...
##  $ total_eve_charge             : num  16.78 16.62 10.3 5.26 12.61 ...
##  $ total_night_minutes          : num  245 254 163 197 187 ...
##  $ total_night_calls            : int  91 103 104 89 121 118 118 96 90 97 ...
##  $ total_night_charge           : num  11.01 11.45 7.32 8.86 8.41 ...
##  $ total_intl_minutes           : num  10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
##  $ total_intl_calls             : int  3 3 5 7 3 6 7 6 4 5 ...
##  $ total_intl_charge            : num  2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
##  $ number_customer_service_calls: int  1 1 0 2 3 0 3 0 1 0 ...
##  $ churn                        : Factor w/ 2 levels "Does_Churn","Stays": 2 2 2 2 2 2 2 2 2 2 ...

Hi there, this is Manuel Amunategui- if you're enjoying the content, find more at ViralML.com

Straight up C5.0

Let’s first look at the default summary output from C5.0:

c50_model <- C5.0(x=churn_data[,setdiff(names(churn_data), outcome_name)], y=churn_data[,outcome_name])
summary(c50_model)

## 
## Call:
## C5.0.default(x = churn_data[, setdiff(names(churn_data), outcome_name)],
##  y = churn_data[, outcome_name], rules = FALSE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Tue Dec 27 10:01:02 2016
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 3333 cases (20 attributes) from undefined.data
## 
## Decision tree:
## 
## total_day_minutes > 264.4:
## :...voice_mail_plan = yes:
## :   :...international_plan = no: Stays (45/1)
## :   :   international_plan = yes: Does_Churn (8/3)
## :   voice_mail_plan = no:
## :   :...total_eve_minutes > 187.7:
## :       :...total_night_minutes > 126.9: Does_Churn (94/1)
## :       :   total_night_minutes <= 126.9:
## :       :   :...total_day_minutes <= 277: Stays (4)
## :       :       total_day_minutes > 277: Does_Churn (3)
## :       total_eve_minutes <= 187.7:
## :       :...total_eve_charge <= 12.26: Stays (15/1)
## :           total_eve_charge > 12.26:
## :           :...total_day_minutes <= 277:
## :               :...total_night_minutes <= 224.8: Stays (13)
## :               :   total_night_minutes > 224.8: Does_Churn (5/1)
## :               total_day_minutes > 277:
## :               :...total_night_minutes > 151.9: Does_Churn (18)
## :                   total_night_minutes <= 151.9:
## :                   :...account_length <= 123: Stays (4)
## :                       account_length > 123: Does_Churn (2)
## total_day_minutes <= 264.4:
## :...number_customer_service_calls > 3:
##     :...total_day_minutes <= 160.2:
##     :   :...total_eve_charge <= 19.83: Does_Churn (79/3)
##     :   :   total_eve_charge > 19.83:
##     :   :   :...total_day_minutes <= 120.5: Does_Churn (10)
##     :   :       total_day_minutes > 120.5: Stays (13/3)
##     :   total_day_minutes > 160.2:
##     :   :...total_eve_charge > 12.05: Stays (130/24)
##     :       total_eve_charge <= 12.05:
##     :       :...total_eve_calls <= 125: Does_Churn (16/2)
##     :           total_eve_calls > 125: Stays (3)
##     number_customer_service_calls <= 3:
##     :...international_plan = yes:
##         :...total_intl_calls <= 2: Does_Churn (51)
##         :   total_intl_calls > 2:
##         :   :...total_intl_minutes <= 13.1: Stays (173/7)
##         :       total_intl_minutes > 13.1: Does_Churn (43)
##         international_plan = no:
##         :...total_day_minutes <= 223.2: Stays (2221/60)
##             total_day_minutes > 223.2:
##             :...total_eve_charge <= 20.5: Stays (295/22)
##                 total_eve_charge > 20.5:
##                 :...voice_mail_plan = yes: Stays (20)
##                     voice_mail_plan = no:
##                     :...total_night_minutes > 174.2: Does_Churn (50/8)
##                         total_night_minutes <= 174.2:
##                         :...total_day_minutes <= 246.6: Stays (12)
##                             total_day_minutes > 246.6:
##                             :...total_day_charge <= 43.33: Does_Churn (4)
##                                 total_day_charge > 43.33: Stays (2)
## 
## 
## Evaluation on training data (3333 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      27  136( 4.1%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     365   118    (a): class Does_Churn
##      18  2832    (b): class Stays
## 
## 
##  Attribute usage:
## 
##  100.00% total_day_minutes
##   93.67% number_customer_service_calls
##   87.73% international_plan
##   20.73% total_eve_charge
##    8.97% voice_mail_plan
##    8.01% total_intl_calls
##    6.48% total_intl_minutes
##    6.33% total_night_minutes
##    4.74% total_eve_minutes
##    0.57% total_eve_calls
##    0.18% account_length
##    0.18% total_day_charge
## 
## 
## Time: 0.0 secs

Note from C5.0: An Informal Tutorial: "For instance, the last leaf of the decision tree is compensated (174.6/24.8), for which n is 174.6 and m is 24.8. The value of n is the number of cases in the file hypothyroid.data that are mapped to this leaf, and m (if it appears) is the number of them that are classified incorrectly by the leaf. (A non-integral number of cases can arise because, when the value of an attribute in the tree is not known, C5.0 splits the case and sends a fraction down each branch.)"
C5.0: An Informal Tutorial

Now, if we turn on rules to TRUE (see ?C5.0: A logical: should the tree be decomposed into a rule-based model?). We get the same output but in a more readable format and that is going to help us get this information into a structured format:

c50_model <- C5.0(x=churn_data[,setdiff(names(churn_data), outcome_name)], y=churn_data[,outcome_name], rules = TRUE)
summary(c50_model)

## 
## Call:
## C5.0.default(x = churn_data[, setdiff(names(churn_data), outcome_name)],
##  y = churn_data[, outcome_name], rules = TRUE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Tue Dec 27 10:01:02 2016
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 3333 cases (20 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (60, lift 6.8)
##  international_plan = yes
##  total_intl_calls <= 2
##  ->  class Does_Churn  [0.984]
## 
## Rule 2: (57, lift 6.8)
##  international_plan = yes
##  total_intl_minutes > 13.1
##  ->  class Does_Churn  [0.983]
## 
## Rule 3: (32, lift 6.7)
##  total_day_minutes <= 120.5
##  number_customer_service_calls > 3
##  ->  class Does_Churn  [0.971]
## 
## Rule 4: (79/3, lift 6.6)
##  total_day_minutes <= 160.2
##  total_eve_charge <= 19.83
##  number_customer_service_calls > 3
##  ->  class Does_Churn  [0.951]
## 
## Rule 5: (43/2, lift 6.4)
##  international_plan = no
##  voice_mail_plan = no
##  total_day_minutes > 246.6
##  total_eve_charge > 20.5
##  ->  class Does_Churn  [0.933]
## 
## Rule 6: (28/2, lift 6.2)
##  total_day_minutes <= 264.4
##  total_eve_calls <= 125
##  total_eve_charge <= 12.05
##  number_customer_service_calls > 3
##  ->  class Does_Churn  [0.900]
## 
## Rule 7: (78/8, lift 6.1)
##  voice_mail_plan = no
##  total_day_minutes > 223.2
##  total_eve_charge > 20.5
##  total_night_minutes > 174.2
##  ->  class Does_Churn  [0.888]
## 
## Rule 8: (114/24, lift 5.4)
##  voice_mail_plan = no
##  total_day_minutes > 223.2
##  total_eve_charge > 20.5
##  ->  class Does_Churn  [0.784]
## 
## Rule 9: (152/58, lift 4.3)
##  total_day_minutes > 223.2
##  total_eve_charge > 20.5
##  ->  class Does_Churn  [0.617]
## 
## Rule 10: (211/84, lift 4.1)
##  total_day_minutes > 264.4
##  ->  class Does_Churn  [0.601]
## 
## Rule 11: (2221/60, lift 1.1)
##  international_plan = no
##  total_day_minutes <= 223.2
##  number_customer_service_calls <= 3
##  ->  class Stays  [0.973]
## 
## Rule 12: (768/20, lift 1.1)
##  international_plan = no
##  voice_mail_plan = yes
##  number_customer_service_calls <= 3
##  ->  class Stays  [0.973]
## 
## Rule 13: (140/5, lift 1.1)
##  account_length <= 123
##  total_eve_minutes <= 187.7
##  total_night_minutes <= 151.9
##  ->  class Stays  [0.958]
## 
## Rule 14: (45/1, lift 1.1)
##  international_plan = no
##  voice_mail_plan = yes
##  total_day_minutes > 264.4
##  ->  class Stays  [0.957]
## 
## Rule 15: (1972/87, lift 1.1)
##  total_day_minutes <= 264.4
##  total_intl_minutes <= 13.1
##  total_intl_calls > 2
##  number_customer_service_calls <= 3
##  ->  class Stays  [0.955]
## 
## Rule 16: (197/9, lift 1.1)
##  total_day_minutes > 120.5
##  total_day_minutes <= 160.2
##  total_eve_charge > 19.83
##  ->  class Stays  [0.950]
## 
## Rule 17: (155/10, lift 1.1)
##  voice_mail_plan = no
##  total_day_minutes <= 277
##  total_night_minutes <= 126.9
##  ->  class Stays  [0.930]
## 
## Rule 18: (1675/185, lift 1.0)
##  total_day_minutes > 160.2
##  total_day_minutes <= 264.4
##  total_eve_charge > 12.05
##  ->  class Stays  [0.889]
## 
## Rule 19: (434/49, lift 1.0)
##  total_eve_charge <= 12.26
##  ->  class Stays  [0.885]
## 
## Default class: Stays
## 
## 
## Evaluation on training data (3333 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##      19  146( 4.4%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     371   112    (a): class Does_Churn
##      34  2816    (b): class Stays
## 
## 
##  Attribute usage:
## 
##   98.23% total_day_minutes
##   84.61% number_customer_service_calls
##   75.73% international_plan
##   71.83% total_eve_charge
##   60.97% total_intl_calls
##   60.88% total_intl_minutes
##   31.02% voice_mail_plan
##   10.11% total_night_minutes
##    4.20% account_length
##    4.20% total_eve_minutes
##    0.84% total_eve_calls
## 
## 
## Time: 0.0 secs

Let’s analyze one of the rules:

Rule x: (152/58, lift 4.3)
    total_day_minutes > 223.2
    total_eve_charge > 20.5
    ->  class Does_Churn  [0.617]

It states that 152 cases with total_day_minutes > 223.2 and total_eve_charge > 20.5 led to churn==Does_Churn. There were 152 cases fulling the above condition where 58 failed to lead to the stated outcome. This rule provided a lift of 4.3 in predicting churn.

Automating a Report

Though the output is interesting and insightful, we need this in a data-frame format so we can build it into our reporting pipeline. Let’s see what we have to work with:

head(c50_model$rules)

## [1] "id=\"See5/C5.0 2.07 GPL Edition 2016-12-27\"\nentries=\"1\"\nrules=\"19\" default=\"Stays\"\nconds=\"2\" cover=\"60\" ok=\"60\" lift=\"6.78932\" class=\"Does_Churn\"\ntype=\"1\" att=\"international_plan\" val=\"yes\"\ntype=\"2\" att=\"total_intl_calls\" cut=\"2\" result=\"<\"\nconds=\"2\" cover=\"57\" ok=\"57\" lift=\"6.78366\" class=\"Does_Churn\"\ntype=\"1\" att=\"international_plan\" val=\"yes\"\ntype=\"2\" att=\"total_intl_minutes\" cut=\"13.1\" result=\">\"\nconds=\"2\" cover=\"32\" ok=\"32\" lift=\"6.69766\" class=\"Does_Churn\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"120.5\" result=\"<\"\ntype=\"2\" att=\"number_customer_service_calls\" cut=\"3\" result=\">\"\nconds=\"3\" cover=\"79\" ok=\"76\" lift=\"6.55985\" class=\"Does_Churn\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"160.2\" result=\"<\"\ntype=\"2\" att=\"total_eve_charge\" cut=\"19.83\" result=\"<\"\ntype=\"2\" att=\"number_customer_service_calls\" cut=\"3\" result=\">\"\nconds=\"4\" cover=\"43\" ok=\"41\" lift=\"6.44058\" class=\"Does_Churn\"\ntype=\"1\" att=\"international_plan\" val=\"no\"\ntype=\"1\" att=\"voice_mail_plan\" val=\"no\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"246.60001\" result=\">\"\ntype=\"2\" att=\"total_eve_charge\" cut=\"20.5\" result=\">\"\nconds=\"4\" cover=\"28\" ok=\"26\" lift=\"6.21056\" class=\"Does_Churn\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"264.39999\" result=\"<\"\ntype=\"2\" att=\"total_eve_calls\" cut=\"125\" result=\"<\"\ntype=\"2\" att=\"total_eve_charge\" cut=\"12.05\" result=\"<\"\ntype=\"2\" att=\"number_customer_service_calls\" cut=\"3\" result=\">\"\nconds=\"4\" cover=\"78\" ok=\"70\" lift=\"6.1243\" class=\"Does_Churn\"\ntype=\"1\" att=\"voice_mail_plan\" val=\"no\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"223.2\" result=\">\"\ntype=\"2\" att=\"total_eve_charge\" cut=\"20.5\" result=\">\"\ntype=\"2\" att=\"total_night_minutes\" cut=\"174.2\" result=\">\"\nconds=\"3\" cover=\"114\" ok=\"90\" lift=\"5.41342\" class=\"Does_Churn\"\ntype=\"1\" att=\"voice_mail_plan\" val=\"no\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"223.2\" result=\">\"\ntype=\"2\" att=\"total_eve_charge\" cut=\"20.5\" result=\">\"\nconds=\"2\" cover=\"152\" ok=\"94\" lift=\"4.25688\" class=\"Does_Churn\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"223.2\" result=\">\"\ntype=\"2\" att=\"total_eve_charge\" cut=\"20.5\" result=\">\"\nconds=\"1\" cover=\"211\" ok=\"127\" lift=\"4.14685\" class=\"Does_Churn\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"264.39999\" result=\">\"\nconds=\"3\" cover=\"2221\" ok=\"2161\" lift=\"1.13738\" class=\"Stays\"\ntype=\"1\" att=\"international_plan\" val=\"no\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"223.2\" result=\"<\"\ntype=\"2\" att=\"number_customer_service_calls\" cut=\"3\" result=\"<\"\nconds=\"3\" cover=\"768\" ok=\"748\" lift=\"1.13758\" class=\"Stays\"\ntype=\"1\" att=\"international_plan\" val=\"no\"\ntype=\"1\" att=\"voice_mail_plan\" val=\"yes\"\ntype=\"2\" att=\"number_customer_service_calls\" cut=\"3\" result=\"<\"\nconds=\"3\" cover=\"140\" ok=\"135\" lift=\"1.12006\" class=\"Stays\"\ntype=\"2\" att=\"account_length\" cut=\"123\" result=\"<\"\ntype=\"2\" att=\"total_eve_minutes\" cut=\"187.7\" result=\"<\"\ntype=\"2\" att=\"total_night_minutes\" cut=\"151.89999\" result=\"<\"\nconds=\"3\" cover=\"45\" ok=\"44\" lift=\"1.11971\" class=\"Stays\"\ntype=\"1\" att=\"international_plan\" val=\"no\"\ntype=\"1\" att=\"voice_mail_plan\" val=\"yes\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"264.39999\" result=\">\"\nconds=\"4\" cover=\"1972\" ok=\"1885\" lift=\"1.11734\" class=\"Stays\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"264.39999\" result=\"<\"\ntype=\"2\" att=\"total_intl_minutes\" cut=\"13.1\" result=\"<\"\ntype=\"2\" att=\"total_intl_calls\" cut=\"2\" result=\">\"\ntype=\"2\" att=\"number_customer_service_calls\" cut=\"3\" result=\"<\"\nconds=\"3\" cover=\"197\" ok=\"188\" lift=\"1.11071\" class=\"Stays\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"120.5\" result=\">\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"160.2\" result=\"<\"\ntype=\"2\" att=\"total_eve_charge\" cut=\"19.83\" result=\">\"\nconds=\"3\" cover=\"155\" ok=\"145\" lift=\"1.08754\" class=\"Stays\"\ntype=\"1\" att=\"voice_mail_plan\" val=\"no\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"277\" result=\"<\"\ntype=\"2\" att=\"total_night_minutes\" cut=\"126.9\" result=\"<\"\nconds=\"3\" cover=\"1675\" ok=\"1490\" lift=\"1.03976\" class=\"Stays\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"160.2\" result=\">\"\ntype=\"2\" att=\"total_day_minutes\" cut=\"264.39999\" result=\"<\"\ntype=\"2\" att=\"total_eve_charge\" cut=\"12.05\" result=\">\"\nconds=\"1\" cover=\"434\" ok=\"385\" lift=\"1.03536\" class=\"Stays\"\ntype=\"2\" att=\"total_eve_charge\" cut=\"12.26\" result=\"<\"\n"

Yes, its a mess! But don’t despair, we’ll strsplit our way out of this and make it all come together:


rule_munger <- capture.output(c50_model$rules , split = TRUE)
rule_munger <- strsplit(rule_munger,'\\\\n')
rule_munger <-  gsub(x = rule_munger[[1]], pattern = '\\\\|\"', replacement = '')[-1]
head(rule_munger,6)

## [1] "entries=1"                                           
## [2] "rules=19 default=Stays"                              
## [3] "conds=2 cover=60 ok=60 lift=6.78932 class=Does_Churn"
## [4] "type=1 att=international_plan val=yes"               
## [5] "type=2 att=total_intl_calls cut=2 result=<"          
## [6] "conds=2 cover=57 ok=57 lift=6.78366 class=Does_Churn"

Much better, right?

The output is a bit different so let’s look at it again. It states that C5.0 found 19 rules in total. The first rule has 2 conditions with a cover of 60 (60 rows found with 100% success rate) and a lift of 6.78932 for predicting churn==yes. The type represents the type of condition (like numeric, logical, categorical, etc). The first condition has an attribute value of international_pal==yes and total_intl_calls < 2. There you have it, a rule that is practically readable.

Reporting Magic

To make this easier on all of us, we’re going to build a function called interesting_interactions that will format this automatically and return a compact data frame.




interesting_interactions <- function(the_data_frame, outcome_name) {
  # install.packages(...) if missing
  require(C50)
  require(dplyr)
  
  c5model <- C5.0(
    x = the_data_frame[,setdiff(names(the_data_frame), outcome_name)],
    y = the_data_frame[,outcome_name],
    rules = TRUE
  )
  
  rule_munger <- capture.output(c5model$rules , split = TRUE)
  rule_munger <- strsplit(rule_munger,'\\\\n')
  rule_munger <- gsub(x = rule_munger[[1]], pattern = '\\\\|\"', replacement = '')[-1]
  
  # extract results into data frame format
  rule_count <- 0
  conds_last <- 0
  cover_last <- 0
  ok_last <- 0
  lift_last <-  0
  class_last <- 0
  
  rules <- c()
  for (entry in rule_munger) {
    print(entry)
    if (substr(entry,1,5) == 'rules')
      print(entry)
    
    # track only lines starting with conds or type - ignore rest
    if (substr(entry,1,5) == 'conds' |
        substr(entry,1,4) == 'type') {
      if (substr(entry,1,5) == 'conds') {
        rule_count <- rule_count + 1
        conds_last <-
          strsplit(x = strsplit(x = entry, split = " ")[[1]][1], split = '=')[[1]][2]
        # cover is the number of training cases covered by the rule
        cover_last <-
          strsplit(x = strsplit(x = entry, split = " ")[[1]][2], split = '=')[[1]][2]
        # ok is the number of positives covered by class,
        ok_last <-
          strsplit(x = strsplit(x = entry, split = " ")[[1]][3], split = '=')[[1]][2]
        # lift is the estimated accuracy of the rule
        lift_last <-
          strsplit(x = strsplit(x = entry, split = " ")[[1]][4], split = '=')[[1]][2]
        # class predicted by
        class_last <-
          strsplit(x = strsplit(x = entry, split = " ")[[1]][5], split = '=')[[1]][2]
      }
      
      if (substr(entry,1,4) == 'type') {
        # variable type
        type_last <-
          strsplit(x = strsplit(x = entry, split = " ")[[1]][1], split = '=')[[1]][2]
        att_last <-
          strsplit(x = strsplit(x = entry, split = " ")[[1]][2], split = '=')[[1]][2]
        
        # sniff out optional parameters
        elts_last <- ''
        if (grepl(x = entry, pattern = 'elts')) {
          elts_last <- strsplit(x = entry, split = "elts=")[[1]][2]
        }
        
        cut_last <- ''
        if (grepl(x = entry, pattern = 'cut')) {
          cut_last <-
            strsplit(
              x = strsplit(
                x = entry, split = "cut="
              )[[1]][2], split = ' '
            )[[1]][1]
        }
        
        val_last <- ''
        if (grepl(x = entry, pattern = 'val')) {
          val_last <- strsplit(x = entry, split = "val=")[[1]][2]
        }
        
        result_last <- ''
        if (grepl(x = entry, pattern = 'result')) {
          result_last <- strsplit(x = entry, split = "result=")[[1]][2]
        }
        
        rules <- rbind(
          rules, c(
            rule_count,
            conds_last,
            cover_last,
            ok_last,
            lift_last,
            type_last,
            att_last,
            elts_last,
            result_last,
            cut_last,
            val_last,
            class_last
          )
        )
      }
      
    }
    
  }
  if (!is.null(rules)) {
    
    rules <- data.frame(rules)
    
    names(rules) <-
      c(
        'rule_number', 'conditions', 'cover', 'true_pos',
        'lift', 'type', 'attribute', 'elts', 'cut', 'result',
        'value', 'outcome'
      )
    rules[, 1:6] <- sapply(rules[, 1:6], as.character)
    rules[, 1:6] <- sapply(rules[, 1:6], as.numeric)
    
    if (length(unique(rules$rule_number) > 0)) {
      rules %>% dplyr::arrange(desc(lift)) -> rules
    }
    
  }
  return (rules)
}

Now that we built this function, let’s take it for a spin and see what it what we can learn about our churning and non-churning customers.

results <- interesting_interactions(the_data_frame = churn_data, outcome_name = outcome_name)

print(head(results, 10))

##    rule_number conditions cover true_pos    lift type
## 1            1          2    60       60 6.78932    1
## 2            1          2    60       60 6.78932    2
## 3            2          2    57       57 6.78366    1
## 4            2          2    57       57 6.78366    2
## 5            3          2    32       32 6.69766    2
## 6            3          2    32       32 6.69766    2
## 7            4          3    79       76 6.55985    2
## 8            4          3    79       76 6.55985    2
## 9            4          3    79       76 6.55985    2
## 10           5          4    43       41 6.44058    1
##                        attribute elts cut result value    outcome
## 1             international_plan                   yes Does_Churn
## 2               total_intl_calls        <      2       Does_Churn
## 3             international_plan                   yes Does_Churn
## 4             total_intl_minutes        >   13.1       Does_Churn
## 5              total_day_minutes        <  120.5       Does_Churn
## 6  number_customer_service_calls        >      3       Does_Churn
## 7              total_day_minutes        <  160.2       Does_Churn
## 8               total_eve_charge        <  19.83       Does_Churn
## 9  number_customer_service_calls        >      3       Does_Churn
## 10            international_plan                    no Does_Churn

We can take this one step further and build a report-generating function. Something that can peel the conditions apart and weave them into a sentence:


# rules report
print_rules <- function(rules_found, rulenum) {
  print('')
  print(paste0('Rule #', rulenum))
  dplyr::filter(rules_found, rule_number == rulenum) -> pulled_rule
  dplyr::select(pulled_rule, cover, true_pos, outcome) %>% head(1) -> rule_def
  dplyr::select(pulled_rule, attribute, elts, cut, result, value) -> conditions
  
  print(paste0('In ', rule_def$cover, ' cases, ', round(rule_def$true_pos/rule_def$cover,2)*100, '% customers ',
               as.character(rule_def$outcome),' when:'))
  
  for (cond_id in seq(nrow(conditions))) {
    cond <- conditions[cond_id,]
    #attribute elts cut result value
    if (nchar(as.character(cond$elts)) > 0) {
      print(paste0(cond$attribute, 
                   ': ', cond$elts))
    } else if (nchar(as.character(cond$value)) > 0) {
      print(paste0(cond$attribute, 
                   ' == ', cond$value))
    } else {
      print(paste0(cond$attribute, " ", cond$cut, " ", cond$result))
    }
  }
  print('')
}


# collate interesting rules into report format - top x rules found sorted by lift
for (rule_number in unique(results$rule_number))
  print_rules(results, rule_number)

## [1] ""
## [1] "Rule #1"
## [1] "In 60 cases, 100% customers Does_Churn when:"
## [1] "international_plan == yes"
## [1] "total_intl_calls < 2"
## [1] ""
## [1] ""
## [1] "Rule #2"
## [1] "In 57 cases, 100% customers Does_Churn when:"
## [1] "international_plan == yes"
## [1] "total_intl_minutes > 13.1"
## [1] ""
## [1] ""
## [1] "Rule #3"
## [1] "In 32 cases, 100% customers Does_Churn when:"
## [1] "total_day_minutes < 120.5"
## [1] "number_customer_service_calls > 3"
## [1] ""
## [1] ""
## [1] "Rule #4"
## [1] "In 79 cases, 96% customers Does_Churn when:"
## [1] "total_day_minutes < 160.2"
## [1] "total_eve_charge < 19.83"
## [1] "number_customer_service_calls > 3"
## [1] ""
## [1] ""
## [1] "Rule #5"
## [1] "In 43 cases, 95% customers Does_Churn when:"
## [1] "international_plan == no"
## [1] "voice_mail_plan == no"
## [1] "total_day_minutes > 246.60001"
## [1] "total_eve_charge > 20.5"
## [1] ""
## [1] ""
## [1] "Rule #6"
## [1] "In 28 cases, 93% customers Does_Churn when:"
## [1] "total_day_minutes < 264.39999"
## [1] "total_eve_calls < 125"
## [1] "total_eve_charge < 12.05"
## [1] "number_customer_service_calls > 3"
## [1] ""
## [1] ""
## [1] "Rule #7"
## [1] "In 78 cases, 90% customers Does_Churn when:"
## [1] "voice_mail_plan == no"
## [1] "total_day_minutes > 223.2"
## [1] "total_eve_charge > 20.5"
## [1] "total_night_minutes > 174.2"
## [1] ""
## [1] ""
## [1] "Rule #8"
## [1] "In 114 cases, 79% customers Does_Churn when:"
## [1] "voice_mail_plan == no"
## [1] "total_day_minutes > 223.2"
## [1] "total_eve_charge > 20.5"
## [1] ""
## [1] ""
## [1] "Rule #9"
## [1] "In 152 cases, 62% customers Does_Churn when:"
## [1] "total_day_minutes > 223.2"
## [1] "total_eve_charge > 20.5"
## [1] ""
## [1] ""
## [1] "Rule #10"
## [1] "In 211 cases, 60% customers Does_Churn when:"
## [1] "total_day_minutes > 264.39999"
## [1] ""
## [1] ""
## [1] "Rule #12"
## [1] "In 768 cases, 97% customers Stays when:"
## [1] "international_plan == no"
## [1] "voice_mail_plan == yes"
## [1] "number_customer_service_calls < 3"
## [1] ""
## [1] ""
## [1] "Rule #11"
## [1] "In 2221 cases, 97% customers Stays when:"
## [1] "international_plan == no"
## [1] "total_day_minutes < 223.2"
## [1] "number_customer_service_calls < 3"
## [1] ""
## [1] ""
## [1] "Rule #13"
## [1] "In 140 cases, 96% customers Stays when:"
## [1] "account_length < 123"
## [1] "total_eve_minutes < 187.7"
## [1] "total_night_minutes < 151.89999"
## [1] ""
## [1] ""
## [1] "Rule #14"
## [1] "In 45 cases, 98% customers Stays when:"
## [1] "international_plan == no"
## [1] "voice_mail_plan == yes"
## [1] "total_day_minutes > 264.39999"
## [1] ""
## [1] ""
## [1] "Rule #15"
## [1] "In 1972 cases, 96% customers Stays when:"
## [1] "total_day_minutes < 264.39999"
## [1] "total_intl_minutes < 13.1"
## [1] "total_intl_calls > 2"
## [1] "number_customer_service_calls < 3"
## [1] ""
## [1] ""
## [1] "Rule #16"
## [1] "In 197 cases, 95% customers Stays when:"
## [1] "total_day_minutes > 120.5"
## [1] "total_day_minutes < 160.2"
## [1] "total_eve_charge > 19.83"
## [1] ""
## [1] ""
## [1] "Rule #17"
## [1] "In 155 cases, 94% customers Stays when:"
## [1] "voice_mail_plan == no"
## [1] "total_day_minutes < 277"
## [1] "total_night_minutes < 126.9"
## [1] ""
## [1] ""
## [1] "Rule #18"
## [1] "In 1675 cases, 89% customers Stays when:"
## [1] "total_day_minutes > 160.2"
## [1] "total_day_minutes < 264.39999"
## [1] "total_eve_charge > 12.05"
## [1] ""
## [1] ""
## [1] "Rule #19"
## [1] "In 434 cases, 89% customers Stays when:"
## [1] "total_eve_charge < 12.26"
## [1] ""

Though the output is somewhat customized compared to the original C5.0 output - the beauty here is that you now hold all the rules in a structured format and can customize to your heart's content!

Thanks again for the artwork, Lucas!!

Manuel Amunategui - Follow me on Twitter: @amunategui