Aggregates vs Models in Diagnostics

A common use case in most businesses is preventing churn. A common way of going about this is to compare the averages for those who churned vs. those who didn’t, find what stands out, and start to target that. This may be fine provided (A) the metrics/attributes you find have a clear distinction and compelling narrative to explain why, and (B) your overall set of things to look at is very small. A can happen, but B is rare.

To illustrate this, we’ll use the Telco Customer Churn dataset. Let’s start by looking at the numeric values.

Show the code

library(broom)
library(gt)
library(gtExtras)
library(hrbrthemes)
library(hstats)
library(sjPlot)
library(SHAPforxgboost)
library(tidymodels)
library(tidyverse)

telco_churn <- read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

telco_munged <- telco_churn %>%
  select(-customerID) %>%
  mutate(Churn = as.factor(Churn))

telco_munged %>%
  mutate(Churn = if_else(Churn == "Yes", 1, 0)) %>%
  select(where(is.numeric), where(is.logical), Churn) %>%
  summarise(across(everything(), \(x) mean(x, na.rm = TRUE)),
            .by = Churn) %>%
  mutate(Churn = ifelse(Churn == TRUE, 'Churn', 'No_Churn')) %>%
  pivot_longer(cols = -Churn, names_to = 'metric', values_to = 'value') %>%
  pivot_wider(names_from = Churn, values_from = value)  %>%
  mutate(dif = Churn - No_Churn) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_percent(columns = c(No_Churn, Churn, dif),
              rows = metric %in% c("SeniorCitizen", "Partner", "PhoneService", "PaperlessBilling"),
              decimals = 1) %>%
  fmt_number(columns = c(No_Churn, Churn, dif),
              rows = metric %in% c("tenure", "Dependents"),
              decimals = 1) %>%
  fmt_currency(columns = c(No_Churn, Churn, dif),
              rows = metric %in% c("MonthlyCharges", "TotalCharges"),
              decimals = 1)

metric	No_Churn	Churn	dif
SeniorCitizen	12.9%	25.5%	12.6%
tenure	37.6	18.0	−19.6
MonthlyCharges	$61.3	$74.4	$13.2
TotalCharges	$2,555.3	$1,531.8	−$1,023.5

Looking at this, we’d conclude that basically all of them are influential. What about our categorical data?

Show the code

telco_churn %>%
  select(where(is.character), Churn, -customerID) %>%
  mutate(Churn_num = if_else(Churn == "Yes", 1, 0)) %>%
  pivot_longer(
    cols = -c(Churn, Churn_num), 
    names_to = "col_name", 
    values_to = "col_value"
  ) %>%
  group_by(col_name, col_value) %>%
  summarise(
    count = n(),
    churn_rate = mean(Churn_num, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  group_by(col_name) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_percent(columns = c(churn_rate), decimals = 1) %>%
  fmt_number(columns = c(count), decimals = 0)

col_value	count	churn_rate
Contract
Month-to-month	3,875	42.7%
One year	1,473	11.3%
Two year	1,695	2.8%
Dependents
No	4,933	31.3%
Yes	2,110	15.5%
DeviceProtection
No	3,095	39.1%
No internet service	1,526	7.4%
Yes	2,422	22.5%
InternetService
DSL	2,421	19.0%
Fiber optic	3,096	41.9%
No	1,526	7.4%
MultipleLines
No	3,390	25.0%
No phone service	682	24.9%
Yes	2,971	28.6%
OnlineBackup
No	3,088	39.9%
No internet service	1,526	7.4%
Yes	2,429	21.5%
OnlineSecurity
No	3,498	41.8%
No internet service	1,526	7.4%
Yes	2,019	14.6%
PaperlessBilling
No	2,872	16.3%
Yes	4,171	33.6%
Partner
No	3,641	33.0%
Yes	3,402	19.7%
PaymentMethod
Bank transfer (automatic)	1,544	16.7%
Credit card (automatic)	1,522	15.2%
Electronic check	2,365	45.3%
Mailed check	1,612	19.1%
PhoneService
No	682	24.9%
Yes	6,361	26.7%
StreamingMovies
No	2,785	33.7%
No internet service	1,526	7.4%
Yes	2,732	29.9%
StreamingTV
No	2,810	33.5%
No internet service	1,526	7.4%
Yes	2,707	30.1%
TechSupport
No	3,473	41.6%
No internet service	1,526	7.4%
Yes	2,044	15.2%
gender
Female	3,488	26.9%
Male	3,555	26.2%

Well, that’s a lot. We can see some patterns here, but since there can be any number of combinations of these values, we won’t know which ones really matter. Having dependents seems to prevent churn, but so do two year plans. What if those are correlated and most people who have two-year plans also have dependents? What then?

This is where we move from the aggregate to the individual. Aggregate data is fine, but can also hide stories or worse, lead us down the wrong path. Instead, we can turn to modeling our data.

We’ll start with a simple generalized linear model. This is a fancy way of saying “we’ll use a model to predict a binary outcome”. We’ll run this for everything in the dataset except customer ID, then look at our predictors. It will gives us odd rations - so if it’s 1.20, that means it increases the odds of churning by 20%, while a value of 0.80 means a 20% decrease. The CI (Confidence Interval) will give the expected range for the value given all we know, if it contains 1 (no difference), then it’s likely not significant. A p-value below 0.05 means it’s likely significant.

*Since each customer has their own outcome, we don’t want to include their ID in the model or we’ll drastically overfit. If we know Bob churned, then if the model knows it’s Bob, it’ll know it’s churn and our other variables won’t tell us anything.

Show the code

telco_lm <- glm(Churn ~ ., 
                data = telco_munged,
                family = binomial(link = "logit"))

tab_model(telco_lm)

	Churn
Predictors	Odds Ratios	CI	p
(Intercept)	3.21	0.65 – 15.86	0.153
gender [Male]	0.98	0.86 – 1.11	0.736
SeniorCitizen	1.24	1.05 – 1.47	0.010
Partner [Yes]	1.00	0.86 – 1.16	0.996
Dependents [Yes]	0.86	0.72 – 1.03	0.098
tenure	0.94	0.93 – 0.95	<0.001
PhoneService [Yes]	1.19	0.33 – 4.24	0.792
MultipleLines [Yes]	1.57	1.11 – 2.22	0.011
InternetService [Fiber optic]	5.74	1.20 – 27.49	0.029
InternetService [No]	0.17	0.03 – 0.82	0.027
OnlineSecurity [Yes]	0.81	0.57 – 1.16	0.250
OnlineBackup [Yes]	1.03	0.73 – 1.45	0.882
DeviceProtection [Yes]	1.16	0.82 – 1.64	0.403
TechSupport [Yes]	0.83	0.59 – 1.19	0.318
StreamingTV [Yes]	1.80	0.95 – 3.42	0.070
StreamingMovies [Yes]	1.82	0.96 – 3.46	0.067
Contract [One year]	0.52	0.42 – 0.64	<0.001
Contract [Two year]	0.26	0.18 – 0.36	<0.001
PaperlessBilling [Yes]	1.41	1.22 – 1.63	<0.001
PaymentMethod [Credit card (automatic)]	0.92	0.73 – 1.15	0.442
PaymentMethod [Electronic check]	1.36	1.13 – 1.63	0.001
PaymentMethod [Mailed check]	0.94	0.75 – 1.18	0.616
MonthlyCharges	0.96	0.90 – 1.02	0.204
TotalCharges	1.00	1.00 – 1.00	<0.001
Observations	7032
R² Tjur	0.308

Already this is better. We see that Senior Citizens, Multiple Lines, Fiber Optic, Paperless Billing and E-Check are all associated with higher churn (Odds Ratio > 1), while higher tenure, No Internet Service, One Year Contracts, Two Year Contracts, and Total Charges are associated with lower churn. Note we’re saying associated, we’re not saying there’s necessarily a causal relationship, only that these are present for churn.

This is generally going to take us far enough, but TotalCharges is goofy. It ranges for $18.8 to $8,648.8, which is a pretty wide range, especially for a model that has a bunch of 0s and 1s. We’re also missing interactions - perhaps older customers are less price sensitive, for instance.

We can take it a step further by using XGBoost. Since it’s an ensemble tree model, it’ll capture some interactions for us. We’ll then extract the Shap values, which lets us know per customer what mattered most.

Show the code

#we want "Yes" to be our main target, so we need to set our levels appropriately for XGBoost
telco_munged_td_pre <- telco_munged %>%
  mutate(Churn = fct_relevel(Churn, "Yes", "No"))

#we need dummay variables created now so we can model interactions later. If we were skipping that part, we'd just create a recipe and call it good.
telco_munged_td <- recipe(Churn ~ ., data = telco_munged_td_pre) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
  prep() %>%
  bake(telco_munged_td_pre)

xgb_spec <- boost_tree(
  trees = 1000,
  tree_depth = 4,
  min_n = 10,
  loss_reduction = 0.01,
  sample_size = 0.8,
  mtry = 3,
  learn_rate = 0.01
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

xgb_recipe <- recipe(Churn ~ ., data = telco_munged_td)

xgb_wflow <- workflow() %>%
  add_model(xgb_spec) %>%
  add_recipe(xgb_recipe)

xgb_fit <- fit(xgb_wflow, data = telco_munged_td)

processed_data <- bake(prep(xgb_recipe), new_data = telco_munged_td, has_role("predictor")) %>% 
  as.matrix()

xgb_engine <- extract_fit_engine(xgb_fit)

shap_values <- shap.prep(xgb_model = xgb_engine, X_train = processed_data)

shap.plot.summary(shap_values)

This shows us that high tenured customers are less likely to churn, and contract type makes a difference. Month to month is more likely to churn, two years are far less likely. Note this shows us all categorical values - in our linear model, we only saw one and two year contracts, not month to month. There are ways to see that in a linear model (sjPlot is great for this sort of thing if we plot our model), but tree models like XGBoost are generally better for seeing all levels.

Finally, we can look at our interactions. We’ll use the hstats package to get variable importance. This will also bring back our global (non-interacting) values, which we don’t really need at this point but it doesn’t hurt to look again, either.

Show the code

p_fun <- function(m, X) {
  predict(m, X, type = "prob")$.pred_Yes
}

X_data <- telco_munged_td %>% select(-Churn)

H <- hstats(
  xgb_fit, 
  X = X_data, 
  pred_fun = p_fun, 
  approx = TRUE 
)

plot(H)+
  theme_ipsum()+
  theme(text=element_text(size = 16,  family="Oswald"),
        panel.grid.minor = element_blank(),
        plot.title.position = "plot",
        legend.position = element_blank())

Remember how we were bringing back all levels of contract? Since each level has its own col, it’s possible for contract types to “interact”. This doesn’t make a lot of sense, so we’ll ignore the month to month and two year interaction. Same for tenure and two year, since longer contracts will naturally interact more tenure. This interaction is correct, just not interesting. Ditto tenure and monthly contracts.

Tenure and Fiber Optic looks interesting, though. Let’s plot the SHAP values for those.

Show the code

shap.plot.dependence(
  shap_values, 
  x = "tenure", 
  color_feature = "InternetService_Fiber.optic", 
  smooth = FALSE, 
  jitter_width = 0.01, 
  alpha = 0.4
)+
  theme_ipsum()+
  theme(text=element_text(size = 16,  family="Oswald"))

OK, this is interesting - customers with Fiber Optic (purple) and lower tenure are more likely to churn compared to other low tenure customers. After about 18 months, it doesn’t seem to matter. We can now begin to hypothesize what’s driving this. For instance, we may suspect it’s related to price. We can return to our original dataset to map this out. Shap plots always show Shap on the y axis, at this point I want to see the raw data and have more flexibility on how I display it.

Show the code

telco_munged %>%
  ggplot(aes(x = tenure, y = MonthlyCharges, color = Churn))+
  geom_point(alpha = 0.4, size = 0.8)+
  facet_wrap(InternetService ~ .)+
  theme_ipsum()+
  theme(text=element_text(size = 16,  family="Oswald"))+
  scale_color_manual(values = c('yellow', 'purple'))

There we go - Fiber Optic costs more, so we think that new customers don’t see the value and choose something else. We would want to vet this more, of course. If we have churn codes in our CRM, we could look at those for this cohort. And we’d want to explore possible ways to mitigate it, such as trying to increase value awareness, offering introductory discount pricing, etc.

So there we go. We wound up identifying a new cohort - new Fiber Optic Customers - and a plan to find ways to reduce their churn. We could possibly have gotten there with aggregate data, but it would have taken a lot of exploration and trial and error. Note also that some domain knowledge is still needed. I’m not in telecom, but still saw interactions that didn’t make sense to me and ignored those to focus on the ones that looked promising. And this isn’t the only takeaway we could get from this data, either. For instance, we may want to look at ways to move customers from monthly to annual contracts. This is what modeling the data gets us - a framework to ask better questions and to surface patterns in high dimensional data that we may otherwise have missed.