MLB Team Replay Performance

statistics
r
mlb
probability
Author

Mark Jurries II

Published

October 28, 2025

MLB’s allowed teams to challenge calls for several years now. Thanks to Baseball Savant, we can see info on every call made as well. Scraping the data isn’t particularly difficult, so let’s look at 2025 and see what teams performed best. We’ll ignore replays initiated by umpires, as well as the All-Star game and all postseason games. This leaves us with 1,154 total replays.

Show the code
library(hrbrthemes)
library(janitor)
library(gt)
library(gtExtras)
library(rvest)
library(tidyverse)

url <- "https://baseballsavant.mlb.com/replay"

page <- read_html(url)

tables <- html_table(page)

replay_data <- pluck(tables, 1) %>%
  clean_names() %>%
  filter(!is.na(inning)) %>%
  mutate(is_over_turned = ifelse(over_turned == 'Yes', 1, 0))

team_replay_data <- replay_data %>%
  filter(challenging_team != 'Umpire') %>%
  mutate(game_date = as.Date(game_date)) %>%
  filter(game_date != as.Date('2025-07-15')) %>%
  filter(game_date < as.Date('2025-09-30'))

team_replay_data %>%
  group_by(challenging_team) %>%
  summarise(n = n(),
            overturned_rate = mean(is_over_turned)) %>%
  arrange(desc(overturned_rate)) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_percent(columns = c(overturned_rate), decimals = 1)
challenging_team n overturned_rate
HOU 32 71.9%
PHI 34 70.6%
MIL 26 69.2%
AZ 48 66.7%
KC 30 66.7%
COL 33 63.6%
SEA 30 63.3%
NYM 35 62.9%
LAD 43 60.5%
SD 30 60.0%
CIN 37 59.5%
CWS 32 59.4%
NYY 39 59.0%
STL 43 58.1%
ATL 31 58.1%
WSH 38 57.9%
MIN 40 57.5%
SF 46 56.5%
BOS 25 56.0%
CHC 51 54.9%
PIT 42 54.8%
CLE 37 54.1%
TOR 50 54.0%
LAA 32 50.0%
MIA 40 47.5%
ATH 55 45.5%
DET 36 44.4%
BAL 45 42.2%
TB 43 41.9%
TEX 51 41.2%

The Astros, Brewers, and Phillies all did well with their replays, each above or around 70%. The Rangers, Rays, and Orioles were all pretty bad, each under 43% (with my Tigers only at 44%). Only six teams had a losing record on their replays.

There are different types of calls, let’s take a look at those.

Show the code
team_replay_data %>%
  group_by(type) %>%
  summarise(n = n(),
            overturned_rate = mean(is_over_turned)) %>%
  arrange(desc(n)) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_percent(columns = c(overturned_rate), decimals = 1)
type n overturned_rate
Tag play 531 52.4%
Close play at 1st 339 70.5%
Hit by pitch 90 43.3%
Force play 67 44.8%
Catcher interference 27 77.8%
Catch/drop in outfield 25 40.0%
Fair/foul in outfield 20 50.0%
Stadium boundary call 12 75.0%
Home-plate collision 11 0.0%
Trap play in outfield 9 44.4%
Tag-up play 5 20.0%
Touching a base 5 0.0%
Fan interference 3 66.7%
Slide interference 3 66.7%
Timing Play 3 0.0%
Other 2 50.0%
1 100.0%
Pitch result 1 0.0%

Tag play being basically breakeven checks out, as does close play at 1st being generally overturned. A bit surprised to see hit by pitch up so high, though 90 over a season isn’t a lot.

Next, let’s break it down by inning.

Show the code
team_replay_data %>%
  group_by(inning) %>%
  summarise(n = n(),
            overturned_rate = mean(is_over_turned)) %>%
  arrange(inning) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_percent(columns = c(overturned_rate), decimals = 1)
inning n overturned_rate
1 107 77.6%
2 94 76.6%
3 100 66.0%
4 98 72.4%
5 121 57.0%
6 144 58.3%
7 137 51.1%
8 168 43.5%
9 140 32.9%
10 32 25.0%
11 10 30.0%
12 3 66.7%

Here’s another intuitive finding. Since teams have one replay per game and will lose it if their challenge isn’t upheld, they’re only likely to challenge plays in early innings where the stakes aren’t as high if they feel they’re likely to retain their challenge. As the game goes on, calls are less likely to be overturned, most likely because teams are more willing to risk losing a challenge if it helps them win. It’s also possible that the replay booth becomes more conservative about overturning calls, since that could affect the outcome of the game.

Having the general lay of the land, we can now build a simple model to see which teams got the most out of the replays they had. The model will consider the replay type and the inning. We’ll then use the model to predict a team’s predicted overturned rate and compare to their actual.

Show the code
replay_pred <- glm(is_over_turned ~ type * inning,
                   data = team_replay_data,
                   family = binomial(link="logit"))

team_replay_data_with_pred <- team_replay_data %>%
  mutate(pred = predict(replay_pred, team_replay_data,type = "response"))

team_replay_data_with_pred %>%
  group_by(challenging_team) %>%
  summarise(n = n(),
            overturned_rate = mean(is_over_turned),
            pred = mean(pred)) %>%
  mutate(ot_oe = overturned_rate - pred,
         calls_over_expected = ot_oe * n) %>%
  arrange(desc(calls_over_expected)) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_percent(columns = c('overturned_rate', 'pred', 'ot_oe'), decimals = 1) %>%
  fmt_number(columns = c(calls_over_expected), decimals = 2)
challenging_team n overturned_rate pred ot_oe calls_over_expected
PHI 34 70.6% 52.2% 18.4% 6.26
AZ 48 66.7% 54.0% 12.7% 6.09
HOU 32 71.9% 57.8% 14.0% 4.49
KC 30 66.7% 55.7% 11.0% 3.30
CHC 51 54.9% 48.9% 6.0% 3.08
NYM 35 62.9% 54.2% 8.6% 3.02
SEA 30 63.3% 55.3% 8.1% 2.42
WSH 38 57.9% 52.3% 5.6% 2.12
MIL 26 69.2% 61.9% 7.3% 1.91
COL 33 63.6% 59.9% 3.7% 1.23
SD 30 60.0% 57.4% 2.6% 0.77
SF 46 56.5% 55.0% 1.5% 0.69
NYY 39 59.0% 57.4% 1.6% 0.63
ATL 31 58.1% 57.1% 1.0% 0.31
CIN 37 59.5% 58.7% 0.8% 0.30
BOS 25 56.0% 56.4% −0.4% −0.11
LAD 43 60.5% 61.3% −0.9% −0.37
MIN 40 57.5% 59.2% −1.7% −0.67
CWS 32 59.4% 62.4% −3.0% −0.97
STL 43 58.1% 61.3% −3.1% −1.35
TOR 50 54.0% 57.4% −3.4% −1.69
PIT 42 54.8% 59.2% −4.5% −1.87
LAA 32 50.0% 56.1% −6.1% −1.96
CLE 37 54.1% 61.1% −7.1% −2.62
ATH 55 45.5% 50.8% −5.3% −2.93
TB 43 41.9% 50.7% −8.9% −3.82
DET 36 44.4% 55.8% −11.3% −4.08
MIA 40 47.5% 58.6% −11.1% −4.45
BAL 45 42.2% 52.2% −10.0% −4.48
TEX 51 41.2% 51.5% −10.3% −5.26

Our model - which we should caveat is very simple - predicted the Diamandbacks to only overturn 53% of their challenges. They instead got 66%, a 13pt increase. That’s about 6.5 calls, which depending on the game state could be very impactful. (Arizona finished under .500, so it didn’t help them that much in the end.) The Phillies had a better rate, but Arizona had more challenges so they had a higher calls over expected tally. That said, given the sample sizes, it’s hard to read too much into this. Let’s also note that the Dodgers are middle of the pack, both in raw and predicted terms, yet are on their way to another World Series win.

*Having additional information, such as score, which umpire made the call, if the team was objecting for offense or defense, etc. would add a lot of value.

It’s always helpful to investigate the data a little bit more, so let’s look at the Tigers.

Show the code
team_replay_data_with_pred %>%
  filter(challenging_team == 'DET') %>%
  group_by(type) %>%
  summarise(n = n(),
            overturned_rate = mean(is_over_turned),
            pred = mean(pred)) %>%
  mutate(ot_oe = overturned_rate - pred,
         calls_over_expected = ot_oe * n) %>%
  arrange(desc(ot_oe)) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_percent(columns = c('overturned_rate', 'pred', 'ot_oe'), decimals = 1) %>%
  fmt_number(columns = c(calls_over_expected), decimals = 2)
type n overturned_rate pred ot_oe calls_over_expected
Close play at 1st 9 77.8% 67.6% 10.2% 0.92
Home-plate collision 1 0.0% 0.0% −0.0% 0.00
Force play 2 50.0% 63.6% −13.6% −0.27
Catcher interference 3 66.7% 80.5% −13.8% −0.42
Fair/foul in outfield 3 33.3% 49.9% −16.5% −0.50
Catch/drop in outfield 4 25.0% 42.6% −17.6% −0.70
Tag play 14 28.6% 50.8% −22.2% −3.11

The Tigers did well on close plays at first, but were pretty bad at tag plays. The close play at first success is the only thing keeping them from being completely abysmal.

Wrapping up, there are a few things we can note about this. Firstly, having a more robust dataset would help us better identify teams that do well with replay. We can look at Statcast to watch video of every challenge, which would be worthwhile if we really want to dig deep. F

inally, we note that this dataset only includes calls that were challenged, and not calls that should have been challenged. For instance, a runner may have been out at second, but was called safe and the defensive team didn’t challenge. This should be considered a debit on their replay ability, but since we don’t have a way to track these situations we lose sight of this.

The biggest thing to add would be win probability change. Getting a successful challenge during a game that’s 2-2 in the 8th is very different than getting the same challenge when you’re up 10-0 in the 8th. Might make for a fun phase 2 of this at some point.