SHAP Values and XGBoost – Point Estimates

Finding out what variables drive an outcome is one of the most challenging and rewarding parts of being a data analyst. Given a large dataset, you could spend hours combing through the data trying to find something. Or you could create a model and let it surface the findings for you.

There are many excellent ways to do this. I’m a fan of creating either a generalized linear model (GLM) or a linear model (LM), then using sjPlot to explore the outcome. This works really well in many instances, but oftentimes your columns with contain interactions that you may be unaware of. You could find a package* that will iterate through all the columns and try different combinations of interactions, but that takes forever and generally provides weird results, at least in my experience.

*Yes, such a package exists. I forget its name, it’s not bad but I wound up not doing much with it.

Thankfully, XGBoost does a pretty nice job of working that out for us, and we can use shapviz to return Shapley values to see how variables influence the outcome . To prove this out, we’ll use this US Logistics Dataset from Kaggle. We’ll build a model to see how likely it is that a package is delayed given its attributes. Since we don’t know it particularly well, let’s use skimr to take a look first.

library(tidyverse)
library(tidymodels)
library(hrbrthemes)
library(shapviz)
library(patchwork)
library(skimr)

logistics <- read_csv('logistics_shipments_dataset.csv')

skimr::skim(logistics)

Data summary
Name	logistics
Number of rows	2000
Number of columns	11
_______________________
Column type frequency:
character	5
Date	2
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Shipment_ID	1	7	7	2000
Origin_Warehouse	1	12	13	10
Destination	1	5	13	15
Carrier	1	3	16	7
Status	1	4	10	5

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
Shipment_Date	0	1.00	2023-01-01	2023-12-31	2023-07-07	364
Delivery_Date	32	0.98	2023-01-03	2024-01-12	2023-07-12	368

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Weight_kg	0	1.00	30.18	124.97	0.00	12.30	20.70	33.92	5404.20	▇▁▁▁▁
Cost	41	0.98	205.16	222.59	17.89	117.71	196.42	272.12	6562.21	▇▁▁▁▁
Distance_miles	0	1.00	1275.87	691.38	101.00	690.25	1262.50	1867.25	2499.00	▇▇▇▇▇
Transit_Days	0	1.00	4.18	1.84	1.00	3.00	4.00	5.00	12.00	▇▇▃▁▁

We’ve got a few missing values we’ll need to get rid of. There are also some values in Cost that appear to be outliers. In a real-world scenario we’d investigate these more to see if they’re legit, since this is a toy example we’ll just filter them out. We also don’t want shipment ID in the model since that will overfit, and the date cols aren’t particularly useful, either. We’ve transformed Status to a factor for our outcome, so we can drop that as well.

logistics_prepped <- logistics %>%
  filter(Status %in% c('Delayed', 'Delivered')) %>%
  mutate(is_delayed = ifelse(Status == 'Delayed', 1, 0),
         is_delayed = as.factor(is_delayed)) %>%
  select(-Shipment_ID, -Shipment_Date, -Delivery_Date, -Status) %>%
  drop_na() %>%
  filter(Cost < 1000)

For building our model, we’ll use Tidymodels, borrowing a cup of code from Julia Silge’s post on XGBoost to help us out. Since we’re more interested in explanation than predictions, tuning is perhaps overkill for this instance, but we may as well do it right.

set.seed(123)

logistics_recipe <- logistics_prepped %>%
  recipe(is_delayed ~ .) %>%
  step_dummy(Carrier) %>%
  step_dummy(Origin_Warehouse) %>%
  step_dummy(Destination) 

logistics_prepped_small <- logistics_prepped[sample(nrow(logistics_prepped), 1000), ]

logistics_prepped_small_prep <- bake(
  prep(logistics_recipe), 
  has_role("predictor"),
  new_data = logistics_prepped_small, 
  composition = "matrix"
)

xgb_spec <- boost_tree(
  trees = 1000,
  tree_depth = tune(), min_n = tune(),
  loss_reduction = tune(),                     ## first three: model complexity
  sample_size = tune(), mtry = tune(),         ## randomness
  learn_rate = tune()                          ## step size
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), logistics_prepped),
  learn_rate(),
  size = 30
)

xgb_wf <- logistics_recipe %>%
  workflow() %>%
  add_model(xgb_spec)

folds <- vfold_cv(logistics_prepped, strata = is_delayed)

doParallel::registerDoParallel()

xgb_res <- tune_grid(
  xgb_wf,
  resamples = folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE)
)

best_auc <- select_best(xgb_res, metric = "roc_auc")

final_xgb <- finalize_workflow(
  xgb_wf,
  best_auc
)

fit <- final_xgb %>%
  fit(logistics_prepped)

With our model built, we can now extract Shapley values. Let’s do that and see what the model says is most important.

shap <- shapviz(extract_fit_engine(fit), 
                X_pred = logistics_prepped_small_prep, 
                X = logistics_prepped_small_prep)


sv_importance(shap, kind = "beeswarm", show_numbers = TRUE)+
  theme_ipsum()

This shows that if you want to avoid a delay, you’d best be shipping off to Boston. Short of that DHL is a good choice. Shipping out of NYC is a bad idea, though.

The beeswarm does a good job, but it’s also nice to look at variables on their own. Let’s build a function to allow us to do that (we want to customize beyond shapviz’s default), excluding carrier and origin warehouse just to keep this readable. It’d be useful if digging further to create themed collections to see all like info together.

vars <- names(logistics_prepped_small_prep %>% as_tibble())

vars <- vars[!grepl('Carrier_', vars)]
vars <- vars[!grepl('Origin_Warehouse_', vars)]

create_shap_plot <- function(variable){
  sv_dependence(shap, variable, color_var = variable)+
    theme_ipsum()+
    theme(legend.position = 'none')+
    ylab('')
}

plot_list <- map(vars, create_shap_plot)
wrap_plots(plot_list, ncol = 5)

There are some very interesting questions this raises - cost clearly has something going on, for instance, as does distance. Transit days looks fairly sensible, and we may have missed an outlier in weight. All this could lead to some fun next steps, such as log-transforming cost and weight, running just for Boston to see what’s up there, or simple EDAs of our data to explore the outcomes. Regardless, an exercise like this is a good, relatively quick way to see how your data associates to outcomes.

*I saw we since this is quasi-academic. We both know it’s my fault, though.