MLB Team Replay Performance – Point Estimates

MLB’s allowed teams to challenge calls for several years now. Thanks to Baseball Savant, we can see info on every call made as well. Scraping the data isn’t particularly difficult, so let’s look at 2025 and see what teams performed best. We’ll ignore replays initiated by umpires, as well as the All-Star game and all postseason games. This leaves us with 1,154 total replays.

Show the code

library(hrbrthemes)
library(janitor)
library(gt)
library(gtExtras)
library(rvest)
library(tidyverse)

url <- "https://baseballsavant.mlb.com/replay"

page <- read_html(url)

tables <- html_table(page)

replay_data <- pluck(tables, 1) %>%
  clean_names() %>%
  filter(!is.na(inning)) %>%
  mutate(is_over_turned = ifelse(over_turned == 'Yes', 1, 0))

team_replay_data <- replay_data %>%
  filter(challenging_team != 'Umpire') %>%
  mutate(game_date = as.Date(game_date)) %>%
  filter(game_date != as.Date('2025-07-15')) %>%
  filter(game_date < as.Date('2025-09-30'))

team_replay_data %>%
  group_by(challenging_team) %>%
  summarise(n = n(),
            overturned_rate = mean(is_over_turned)) %>%
  arrange(desc(overturned_rate)) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_percent(columns = c(overturned_rate), decimals = 1)

challenging_team	n	overturned_rate
HOU	32	71.9%
PHI	34	70.6%
MIL	26	69.2%
AZ	48	66.7%
KC	30	66.7%
COL	33	63.6%
SEA	30	63.3%
NYM	35	62.9%
LAD	43	60.5%
SD	30	60.0%
CIN	37	59.5%
CWS	32	59.4%
NYY	39	59.0%
STL	43	58.1%
ATL	31	58.1%
WSH	38	57.9%
MIN	40	57.5%
SF	46	56.5%
BOS	25	56.0%
CHC	51	54.9%
PIT	42	54.8%
CLE	37	54.1%
TOR	50	54.0%
LAA	32	50.0%
MIA	40	47.5%
ATH	55	45.5%
DET	36	44.4%
BAL	45	42.2%
TB	43	41.9%
TEX	51	41.2%

The Astros, Brewers, and Phillies all did well with their replays, each above or around 70%. The Rangers, Rays, and Orioles were all pretty bad, each under 43% (with my Tigers only at 44%). Only six teams had a losing record on their replays.

There are different types of calls, let’s take a look at those.

Show the code

team_replay_data %>%
  group_by(type) %>%
  summarise(n = n(),
            overturned_rate = mean(is_over_turned)) %>%
  arrange(desc(n)) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_percent(columns = c(overturned_rate), decimals = 1)

type	n	overturned_rate
Tag play	531	52.4%
Close play at 1st	339	70.5%
Hit by pitch	90	43.3%
Force play	67	44.8%
Catcher interference	27	77.8%
Catch/drop in outfield	25	40.0%
Fair/foul in outfield	20	50.0%
Stadium boundary call	12	75.0%
Home-plate collision	11	0.0%
Trap play in outfield	9	44.4%
Tag-up play	5	20.0%
Touching a base	5	0.0%
Fan interference	3	66.7%
Slide interference	3	66.7%
Timing Play	3	0.0%
Other	2	50.0%
	1	100.0%
Pitch result	1	0.0%

Tag play being basically breakeven checks out, as does close play at 1st being generally overturned. A bit surprised to see hit by pitch up so high, though 90 over a season isn’t a lot.

Next, let’s break it down by inning.

Show the code

team_replay_data %>%
  group_by(inning) %>%
  summarise(n = n(),
            overturned_rate = mean(is_over_turned)) %>%
  arrange(inning) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_percent(columns = c(overturned_rate), decimals = 1)

inning	n	overturned_rate
1	107	77.6%
2	94	76.6%
3	100	66.0%
4	98	72.4%
5	121	57.0%
6	144	58.3%
7	137	51.1%
8	168	43.5%
9	140	32.9%
10	32	25.0%
11	10	30.0%
12	3	66.7%

Here’s another intuitive finding. Since teams have one replay per game and will lose it if their challenge isn’t upheld, they’re only likely to challenge plays in early innings where the stakes aren’t as high if they feel they’re likely to retain their challenge. As the game goes on, calls are less likely to be overturned, most likely because teams are more willing to risk losing a challenge if it helps them win. It’s also possible that the replay booth becomes more conservative about overturning calls, since that could affect the outcome of the game.

Having the general lay of the land, we can now build a simple model to see which teams got the most out of the replays they had. The model will consider the replay type and the inning. We’ll then use the model to predict a team’s predicted overturned rate and compare to their actual.

Show the code

replay_pred <- glm(is_over_turned ~ type * inning,
                   data = team_replay_data,
                   family = binomial(link="logit"))

team_replay_data_with_pred <- team_replay_data %>%
  mutate(pred = predict(replay_pred, team_replay_data,type = "response"))

team_replay_data_with_pred %>%
  group_by(challenging_team) %>%
  summarise(n = n(),
            overturned_rate = mean(is_over_turned),
            pred = mean(pred)) %>%
  mutate(ot_oe = overturned_rate - pred,
         calls_over_expected = ot_oe * n) %>%
  arrange(desc(calls_over_expected)) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_percent(columns = c('overturned_rate', 'pred', 'ot_oe'), decimals = 1) %>%
  fmt_number(columns = c(calls_over_expected), decimals = 2)

challenging_team	n	overturned_rate	pred	ot_oe	calls_over_expected
PHI	34	70.6%	52.2%	18.4%	6.26
AZ	48	66.7%	54.0%	12.7%	6.09
HOU	32	71.9%	57.8%	14.0%	4.49
KC	30	66.7%	55.7%	11.0%	3.30
CHC	51	54.9%	48.9%	6.0%	3.08
NYM	35	62.9%	54.2%	8.6%	3.02
SEA	30	63.3%	55.3%	8.1%	2.42
WSH	38	57.9%	52.3%	5.6%	2.12
MIL	26	69.2%	61.9%	7.3%	1.91
COL	33	63.6%	59.9%	3.7%	1.23
SD	30	60.0%	57.4%	2.6%	0.77
SF	46	56.5%	55.0%	1.5%	0.69
NYY	39	59.0%	57.4%	1.6%	0.63
ATL	31	58.1%	57.1%	1.0%	0.31
CIN	37	59.5%	58.7%	0.8%	0.30
BOS	25	56.0%	56.4%	−0.4%	−0.11
LAD	43	60.5%	61.3%	−0.9%	−0.37
MIN	40	57.5%	59.2%	−1.7%	−0.67
CWS	32	59.4%	62.4%	−3.0%	−0.97
STL	43	58.1%	61.3%	−3.1%	−1.35
TOR	50	54.0%	57.4%	−3.4%	−1.69
PIT	42	54.8%	59.2%	−4.5%	−1.87
LAA	32	50.0%	56.1%	−6.1%	−1.96
CLE	37	54.1%	61.1%	−7.1%	−2.62
ATH	55	45.5%	50.8%	−5.3%	−2.93
TB	43	41.9%	50.7%	−8.9%	−3.82
DET	36	44.4%	55.8%	−11.3%	−4.08
MIA	40	47.5%	58.6%	−11.1%	−4.45
BAL	45	42.2%	52.2%	−10.0%	−4.48
TEX	51	41.2%	51.5%	−10.3%	−5.26

Our model - which we should caveat is very simple - predicted the Diamandbacks to only overturn 53% of their challenges. They instead got 66%, a 13pt increase. That’s about 6.5 calls, which depending on the game state could be very impactful. (Arizona finished under .500, so it didn’t help them that much in the end.) The Phillies had a better rate, but Arizona had more challenges so they had a higher calls over expected tally. That said, given the sample sizes, it’s hard to read too much into this. Let’s also note that the Dodgers are middle of the pack, both in raw and predicted terms, yet are on their way to another World Series win.

*Having additional information, such as score, which umpire made the call, if the team was objecting for offense or defense, etc. would add a lot of value.

It’s always helpful to investigate the data a little bit more, so let’s look at the Tigers.

Show the code

team_replay_data_with_pred %>%
  filter(challenging_team == 'DET') %>%
  group_by(type) %>%
  summarise(n = n(),
            overturned_rate = mean(is_over_turned),
            pred = mean(pred)) %>%
  mutate(ot_oe = overturned_rate - pred,
         calls_over_expected = ot_oe * n) %>%
  arrange(desc(ot_oe)) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_percent(columns = c('overturned_rate', 'pred', 'ot_oe'), decimals = 1) %>%
  fmt_number(columns = c(calls_over_expected), decimals = 2)

type	n	overturned_rate	pred	ot_oe	calls_over_expected
Close play at 1st	9	77.8%	67.6%	10.2%	0.92
Home-plate collision	1	0.0%	0.0%	−0.0%	0.00
Force play	2	50.0%	63.6%	−13.6%	−0.27
Catcher interference	3	66.7%	80.5%	−13.8%	−0.42
Fair/foul in outfield	3	33.3%	49.9%	−16.5%	−0.50
Catch/drop in outfield	4	25.0%	42.6%	−17.6%	−0.70
Tag play	14	28.6%	50.8%	−22.2%	−3.11

The Tigers did well on close plays at first, but were pretty bad at tag plays. The close play at first success is the only thing keeping them from being completely abysmal.

Wrapping up, there are a few things we can note about this. Firstly, having a more robust dataset would help us better identify teams that do well with replay. We can look at Statcast to watch video of every challenge, which would be worthwhile if we really want to dig deep. F

inally, we note that this dataset only includes calls that were challenged, and not calls that should have been challenged. For instance, a runner may have been out at second, but was called safe and the defensive team didn’t challenge. This should be considered a debit on their replay ability, but since we don’t have a way to track these situations we lose sight of this.

The biggest thing to add would be win probability change. Getting a successful challenge during a game that’s 2-2 in the 8th is very different than getting the same challenge when you’re up 10-0 in the 8th. Might make for a fun phase 2 of this at some point.