SHAP Values and XGBoost

statistics
r
machine learning
Author

Mark Jurries II

Published

September 15, 2025

Finding out what variables drive an outcome is one of the most challenging and rewarding parts of being a data analyst. Given a large dataset, you could spend hours combing through the data trying to find something. Or you could create a model and let it surface the findings for you.

There are many excellent ways to do this. I’m a fan of creating either a generalized linear model (GLM) or a linear model (LM), then using sjPlot to explore the outcome. This works really well in many instances, but oftentimes your columns with contain interactions that you may be unaware of. You could find a package* that will iterate through all the columns and try different combinations of interactions, but that takes forever and generally provides weird results, at least in my experience.

*Yes, such a package exists. I forget its name, it’s not bad but I wound up not doing much with it.

Thankfully, XGBoost does a pretty nice job of working that out for us, and we can use shapviz to return Shapley values to see how variables influence the outcome . To prove this out, we’ll use this US Logistics Dataset from Kaggle. We’ll build a model to see how likely it is that a package is delayed given its attributes. Since we don’t know it particularly well, let’s use skimr to take a look first.

library(tidyverse)
library(tidymodels)
library(hrbrthemes)
library(shapviz)
library(patchwork)
library(skimr)

logistics <- read_csv('logistics_shipments_dataset.csv')

skimr::skim(logistics)
Data summary
Name logistics
Number of rows 2000
Number of columns 11
_______________________
Column type frequency:
character 5
Date 2
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Shipment_ID 0 1 7 7 0 2000 0
Origin_Warehouse 0 1 12 13 0 10 0
Destination 0 1 5 13 0 15 0
Carrier 0 1 3 16 0 7 0
Status 0 1 4 10 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
Shipment_Date 0 1.00 2023-01-01 2023-12-31 2023-07-07 364
Delivery_Date 32 0.98 2023-01-03 2024-01-12 2023-07-12 368

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Weight_kg 0 1.00 30.18 124.97 0.00 12.30 20.70 33.92 5404.20 ▇▁▁▁▁
Cost 41 0.98 205.16 222.59 17.89 117.71 196.42 272.12 6562.21 ▇▁▁▁▁
Distance_miles 0 1.00 1275.87 691.38 101.00 690.25 1262.50 1867.25 2499.00 ▇▇▇▇▇
Transit_Days 0 1.00 4.18 1.84 1.00 3.00 4.00 5.00 12.00 ▇▇▃▁▁

We’ve got a few missing values we’ll need to get rid of. There are also some values in Cost that appear to be outliers. In a real-world scenario we’d investigate these more to see if they’re legit, since this is a toy example we’ll just filter them out. We also don’t want shipment ID in the model since that will overfit, and the date cols aren’t particularly useful, either. We’ve transformed Status to a factor for our outcome, so we can drop that as well.

logistics_prepped <- logistics %>%
  filter(Status %in% c('Delayed', 'Delivered')) %>%
  mutate(is_delayed = ifelse(Status == 'Delayed', 1, 0),
         is_delayed = as.factor(is_delayed)) %>%
  select(-Shipment_ID, -Shipment_Date, -Delivery_Date, -Status) %>%
  drop_na() %>%
  filter(Cost < 1000)

For building our model, we’ll use Tidymodels, borrowing a cup of code from Julia Silge’s post on XGBoost to help us out. Since we’re more interested in explanation than predictions, tuning is perhaps overkill for this instance, but we may as well do it right.

set.seed(123)

logistics_recipe <- logistics_prepped %>%
  recipe(is_delayed ~ .) %>%
  step_dummy(Carrier) %>%
  step_dummy(Origin_Warehouse) %>%
  step_dummy(Destination) 

logistics_prepped_small <- logistics_prepped[sample(nrow(logistics_prepped), 1000), ]

logistics_prepped_small_prep <- bake(
  prep(logistics_recipe), 
  has_role("predictor"),
  new_data = logistics_prepped_small, 
  composition = "matrix"
)

xgb_spec <- boost_tree(
  trees = 1000,
  tree_depth = tune(), min_n = tune(),
  loss_reduction = tune(),                     ## first three: model complexity
  sample_size = tune(), mtry = tune(),         ## randomness
  learn_rate = tune()                          ## step size
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), logistics_prepped),
  learn_rate(),
  size = 30
)

xgb_wf <- logistics_recipe %>%
  workflow() %>%
  add_model(xgb_spec)

folds <- vfold_cv(logistics_prepped, strata = is_delayed)

doParallel::registerDoParallel()

xgb_res <- tune_grid(
  xgb_wf,
  resamples = folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE)
)

best_auc <- select_best(xgb_res, metric = "roc_auc")

final_xgb <- finalize_workflow(
  xgb_wf,
  best_auc
)

fit <- final_xgb %>%
  fit(logistics_prepped)

With our model built, we can now extract Shapley values. Let’s do that and see what the model says is most important.

shap <- shapviz(extract_fit_engine(fit), 
                X_pred = logistics_prepped_small_prep, 
                X = logistics_prepped_small_prep)


sv_importance(shap, kind = "beeswarm", show_numbers = TRUE)+
  theme_ipsum()

This shows that if you want to avoid a delay, you’d best be shipping off to Boston. Short of that DHL is a good choice. Shipping out of NYC is a bad idea, though.

The beeswarm does a good job, but it’s also nice to look at variables on their own. Let’s build a function to allow us to do that (we want to customize beyond shapviz’s default), excluding carrier and origin warehouse just to keep this readable. It’d be useful if digging further to create themed collections to see all like info together.

vars <- names(logistics_prepped_small_prep %>% as_tibble())

vars <- vars[!grepl('Carrier_', vars)]
vars <- vars[!grepl('Origin_Warehouse_', vars)]

create_shap_plot <- function(variable){
  sv_dependence(shap, variable, color_var = variable)+
    theme_ipsum()+
    theme(legend.position = 'none')+
    ylab('')
}

plot_list <- map(vars, create_shap_plot)
wrap_plots(plot_list, ncol = 5)

There are some very interesting questions this raises - cost clearly has something going on, for instance, as does distance. Transit days looks fairly sensible, and we may have missed an outlier in weight. All this could lead to some fun next steps, such as log-transforming cost and weight, running just for Boston to see what’s up there, or simple EDAs of our data to explore the outcomes. Regardless, an exercise like this is a good, relatively quick way to see how your data associates to outcomes.

*I saw we since this is quasi-academic. We both know it’s my fault, though.