A common use case in most businesses is preventing churn. A common way of going about this is to compare the averages for those who churned vs. those who didn’t, find what stands out, and start to target that. This may be fine provided (A) the metrics/attributes you find have a clear distinction and compelling narrative to explain why, and (B) your overall set of things to look at is very small. A can happen, but B is rare.
Well, that’s a lot. We can see some patterns here, but since there can be any number of combinations of these values, we won’t know which ones really matter. Having dependents seems to prevent churn, but so do two year plans. What if those are correlated and most people who have two-year plans also have dependents? What then?
This is where we move from the aggregate to the individual. Aggregate data is fine, but can also hide stories or worse, lead us down the wrong path. Instead, we can turn to modeling our data.
We’ll start with a simple generalized linear model. This is a fancy way of saying “we’ll use a model to predict a binary outcome”. We’ll run this for everything in the dataset except customer ID, then look at our predictors. It will gives us odd rations - so if it’s 1.20, that means it increases the odds of churning by 20%, while a value of 0.80 means a 20% decrease. The CI (Confidence Interval) will give the expected range for the value given all we know, if it contains 1 (no difference), then it’s likely not significant. A p-value below 0.05 means it’s likely significant.
*Since each customer has their own outcome, we don’t want to include their ID in the model or we’ll drastically overfit. If we know Bob churned, then if the model knows it’s Bob, it’ll know it’s churn and our other variables won’t tell us anything.
Show the code
telco_lm <-glm(Churn ~ ., data = telco_munged,family =binomial(link ="logit"))tab_model(telco_lm)
Churn
Predictors
Odds Ratios
CI
p
(Intercept)
3.21
0.65 – 15.86
0.153
gender [Male]
0.98
0.86 – 1.11
0.736
SeniorCitizen
1.24
1.05 – 1.47
0.010
Partner [Yes]
1.00
0.86 – 1.16
0.996
Dependents [Yes]
0.86
0.72 – 1.03
0.098
tenure
0.94
0.93 – 0.95
<0.001
PhoneService [Yes]
1.19
0.33 – 4.24
0.792
MultipleLines [Yes]
1.57
1.11 – 2.22
0.011
InternetService [Fiber
optic]
5.74
1.20 – 27.49
0.029
InternetService [No]
0.17
0.03 – 0.82
0.027
OnlineSecurity [Yes]
0.81
0.57 – 1.16
0.250
OnlineBackup [Yes]
1.03
0.73 – 1.45
0.882
DeviceProtection [Yes]
1.16
0.82 – 1.64
0.403
TechSupport [Yes]
0.83
0.59 – 1.19
0.318
StreamingTV [Yes]
1.80
0.95 – 3.42
0.070
StreamingMovies [Yes]
1.82
0.96 – 3.46
0.067
Contract [One year]
0.52
0.42 – 0.64
<0.001
Contract [Two year]
0.26
0.18 – 0.36
<0.001
PaperlessBilling [Yes]
1.41
1.22 – 1.63
<0.001
PaymentMethod [Credit
card (automatic)]
0.92
0.73 – 1.15
0.442
PaymentMethod [Electronic
check]
1.36
1.13 – 1.63
0.001
PaymentMethod [Mailed
check]
0.94
0.75 – 1.18
0.616
MonthlyCharges
0.96
0.90 – 1.02
0.204
TotalCharges
1.00
1.00 – 1.00
<0.001
Observations
7032
R2 Tjur
0.308
Already this is better. We see that Senior Citizens, Multiple Lines, Fiber Optic, Paperless Billing and E-Check are all associated with higher churn (Odds Ratio > 1), while higher tenure, No Internet Service, One Year Contracts, Two Year Contracts, and Total Charges are associated with lower churn. Note we’re saying associated, we’re not saying there’s necessarily a causal relationship, only that these are present for churn.
This is generally going to take us far enough, but TotalCharges is goofy. It ranges for $18.8 to $8,648.8, which is a pretty wide range, especially for a model that has a bunch of 0s and 1s. We’re also missing interactions - perhaps older customers are less price sensitive, for instance.
We can take it a step further by using XGBoost. Since it’s an ensemble tree model, it’ll capture some interactions for us. We’ll then extract the Shap values, which lets us know per customer what mattered most.
Show the code
#we want "Yes" to be our main target, so we need to set our levels appropriately for XGBoosttelco_munged_td_pre <- telco_munged %>%mutate(Churn =fct_relevel(Churn, "Yes", "No"))#we need dummay variables created now so we can model interactions later. If we were skipping that part, we'd just create a recipe and call it good.telco_munged_td <-recipe(Churn ~ ., data = telco_munged_td_pre) %>%step_dummy(all_nominal_predictors(), one_hot =TRUE) %>%prep() %>%bake(telco_munged_td_pre)xgb_spec <-boost_tree(trees =1000,tree_depth =4,min_n =10,loss_reduction =0.01,sample_size =0.8,mtry =3,learn_rate =0.01) %>%set_engine("xgboost") %>%set_mode("classification")xgb_recipe <-recipe(Churn ~ ., data = telco_munged_td)xgb_wflow <-workflow() %>%add_model(xgb_spec) %>%add_recipe(xgb_recipe)xgb_fit <-fit(xgb_wflow, data = telco_munged_td)processed_data <-bake(prep(xgb_recipe), new_data = telco_munged_td, has_role("predictor")) %>%as.matrix()xgb_engine <-extract_fit_engine(xgb_fit)shap_values <-shap.prep(xgb_model = xgb_engine, X_train = processed_data)shap.plot.summary(shap_values)
This shows us that high tenured customers are less likely to churn, and contract type makes a difference. Month to month is more likely to churn, two years are far less likely. Note this shows us all categorical values - in our linear model, we only saw one and two year contracts, not month to month. There are ways to see that in a linear model (sjPlot is great for this sort of thing if we plot our model), but tree models like XGBoost are generally better for seeing all levels.
Finally, we can look at our interactions. We’ll use the hstats package to get variable importance. This will also bring back our global (non-interacting) values, which we don’t really need at this point but it doesn’t hurt to look again, either.
Remember how we were bringing back all levels of contract? Since each level has its own col, it’s possible for contract types to “interact”. This doesn’t make a lot of sense, so we’ll ignore the month to month and two year interaction. Same for tenure and two year, since longer contracts will naturally interact more tenure. This interaction is correct, just not interesting. Ditto tenure and monthly contracts.
Tenure and Fiber Optic looks interesting, though. Let’s plot the SHAP values for those.
OK, this is interesting - customers with Fiber Optic (purple) and lower tenure are more likely to churn compared to other low tenure customers. After about 18 months, it doesn’t seem to matter. We can now begin to hypothesize what’s driving this. For instance, we may suspect it’s related to price. We can return to our original dataset to map this out. Shap plots always show Shap on the y axis, at this point I want to see the raw data and have more flexibility on how I display it.
Show the code
telco_munged %>%ggplot(aes(x = tenure, y = MonthlyCharges, color = Churn))+geom_point(alpha =0.4, size =0.8)+facet_wrap(InternetService ~ .)+theme_ipsum()+theme(text=element_text(size =16, family="Oswald"))+scale_color_manual(values =c('yellow', 'purple'))
There we go - Fiber Optic costs more, so we think that new customers don’t see the value and choose something else. We would want to vet this more, of course. If we have churn codes in our CRM, we could look at those for this cohort. And we’d want to explore possible ways to mitigate it, such as trying to increase value awareness, offering introductory discount pricing, etc.
So there we go. We wound up identifying a new cohort - new Fiber Optic Customers - and a plan to find ways to reduce their churn. We could possibly have gotten there with aggregate data, but it would have taken a lot of exploration and trial and error. Note also that some domain knowledge is still needed. I’m not in telecom, but still saw interactions that didn’t make sense to me and ignored those to focus on the ones that looked promising. And this isn’t the only takeaway we could get from this data, either. For instance, we may want to look at ways to move customers from monthly to annual contracts. This is what modeling the data gets us - a framework to ask better questions and to surface patterns in high dimensional data that we may otherwise have missed.