Run Scoring and Switching Leagues

statistics
r
baseball
Author

Mark Jurries II

Published

August 26, 2025

The DH was introduced into the American League in 1973. This meant that, instead of pitchers flailing away trying to hit, a higher quality slugger would be taking their place instead. Within a few years, the AL was scoring more runs per game than the NL, since they weren’t giving as many at bats to poor hitters.

Show the code
library(hrbrthemes)
library(gt)
library(gtExtras)
library(Lahman)
library(MarketMatching)
library(tidyverse)

l_teams <- Lahman::Teams %>%
  as_tibble() %>%
  mutate(franchID = as.character(franchID))

l_teams %>%
  filter(yearID >= 1920) %>%
  filter(lgID %in% c('AL', 'NL')) %>%
  group_by(yearID, lgID) %>%
  summarise(G = sum(G),
            R = sum(R)) %>%
  mutate(runs_per_game = R / G) %>%
  ggplot(aes(x = yearID, y = runs_per_game, color = lgID))+
  geom_line()+
  scale_color_manual(values = c('#EE0A46', '#0E4082'))+
  theme_ipsum()+
  geom_vline(xintercept = 1973, linetype = 'dashed')

In this timeframe, the Brewers moved from the AL to the NL, losing their DH. Later the Astros moved from the NL to the AL, gaining a DH. These present some natural experiments we can look at to see the impact switching leagues had on their run scoring. First, let’s look at runs scored by year by team, coloring by league.

Show the code
l_teams %>%
  filter(yearID >= 1920) %>%
  filter(lgID %in% c('AL', 'NL')) %>%
  ggplot(aes(x = yearID, y = R, color = lgID))+
  geom_line()+
  facet_wrap(franchID ~ .)+
  scale_color_manual(values = c('#EE0A46', '#0E4082'))+
  theme_ipsum()

It’s hard to see if there’s a difference just by eyeballing the trendlines. Instead, we’ll apply a bit more rigor and apply some causal inference here. This technique looks at data prior to a change and forecasts what would have been expected if things continued as they were. To help with forecast accuracy, we’ll use the MarketMatching package. This lets us find teams with similar run scoring patterns prior to the change and use their post-change info to arrive at a reasonable forecast.

There are a few things we need to be mindful of here. Firstly, it will look at all teams, even expansion teams who don’t have history in our prior period. Thus, when we look at what happened to the Brewers, we’ll exclude teams that weren’t around then they started in 1969.

Secondly, MarketMatching defaults to high reliance on dynamic time warping. This is very useful when comparing regions in different time zones and/or looking at daily/weekly data, but since we’re looking at annual numbers, we’ll dial it down to 0.1.

We’re looking at the Brewers first. They were in the AL from 1969 to 2008, so we’ll use that to find similar teams. Let’s first see what teams it uses for matching.

Show the code
l_mm_mil <- l_teams %>%
  filter(yearID >= 1969 & yearID <= 2008) %>%
  mutate(year_date = make_date(year = yearID, month = 1, day = 1)) %>%
  group_by(franchID) %>%
  mutate(first_year = min(yearID)) %>%
  filter(first_year <= 1969)

mm_mil <- MarketMatching::best_matches(data = l_mm_mil, 
                                   id = "franchID",
                                   date_variable = "year_date",
                                   matching_variable = "R",
                                   parallel = FALSE,
                                   markets_to_be_matched = "MIL",
                                   warping_limit = 1,
                                   dtw_emphasis = 0.1,
                                   matches = 5,
                                   start_match_period = "1969-01-01",
                                   end_match_period = "1997-01-01")

head(mm_mil$BestMatches) %>%
  select(BestControl, Correlation) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_number(columns = c('Correlation'), decimals = 3)
BestControl Correlation
ANA 0.748
WSN 0.730
KCR 0.675
TEX 0.667
CHW 0.656

This is good - it finds that the teams with the most similar annual run scoring to the Brewers from 1969 - 1997 where primarily AL teams. We can now build our counterfactual forecast and compare to actual.

Show the code
results_mil <- MarketMatching::inference(matched_markets = mm_mil, 
                                     test_market = "MIL", 
                                     analyze_betas = TRUE,
                                     end_post_period = "2007-01-01", 
                                     prior_level_sd = 0.1)
    ------------- Inputs -------------
    Market ID: franchID
    Date Variable: year_date
    Matching Metric: R

    Test Market: MIL
    Control Market 1: ANA
    Control Market 2: CHW
    Control Market 3: KCR
    Control Market 4: TEX
    Control Market 5: WSN

    Matching (pre) Period Start Date: 1969-01-01
    Matching (pre) Period End Date: 1997-01-01
    Post Period Start Date: 1998-01-01
    Post Period End Date: 2007-01-01

    bsts parameters: 
      prior.level.sd: 0.1
      No seasonality component (controlled for by the matched markets) 
    Posterior Intervals Tail Area: 95%

    ------------- Model Stats -------------
    Matching (pre) Period MAPE: 6.68%
    Beta 1 [ANA]: 0.4585
    Beta 2 [CHW]: 0.2618
    Beta 3 [KCR]: 0.3842
    Beta 4 [TEX]: 0.2942
    Beta 5 [WSN]: 0.5305
    DW: 2.22

    ------------- Effect Analysis -------------
    Absolute Effect: -553.08 [-1335.62, 209.9]
    Relative Effect: -6.84% [-15.59%, 2.99%]
    Probability of a causal impact: 90.4618%

Our model is off by about 6.7% per year from what the Brewers actually did - not fantastic, but not bad, either. It’s helpful to visualize this as well, so let’s take that step.

Show the code
results_mil$PlotActualVersusExpected+
  theme_ipsum()+
  scale_color_manual(values = c('darkgrey', 'blue'))

There’s a dropoff in run scoring, but the total is still within the expected values. The Casual Impact package has a handy, if verbose, report, which we can reference for more detail.

Show the code
results_mil$CausalImpactObject$report %>%
  as_tibble() %>%
  gt() %>%
  gt_theme_espn() %>%
  tab_options(column_labels.hidden = TRUE)
During the post-intervention period, the response variable had an average value of approx. 723.40. In the absence of an intervention, we would have expected an average response of 778.71. The 95% interval of this counterfactual prediction is [702.41, 856.96]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is -55.31 with a 95% interval of [-133.56, 20.99]. For a discussion of the significance of this effect, see below. Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 7.23K. Had the intervention not taken place, we would have expected a sum of 7.79K. The 95% interval of this prediction is [7.02K, 8.57K]. The above results are given in terms of absolute numbers. In relative terms, the response variable showed a decrease of -7%. The 95% interval of this percentage is [-16%, +3%]. This means that, although it may look as though the intervention has exerted a negative effect on the response variable when considering the intervention period as a whole, this effect is not statistically significant, and so cannot be meaningfully interpreted. The apparent effect could be the result of random fluctuations that are unrelated to the intervention. This is often the case when the intervention period is very long and includes much of the time when the effect has already worn off. It can also be the case when the intervention period is too short to distinguish the signal from the noise. Finally, failing to find a significant effect can happen when there are not enough control variables or when these variables do not correlate well with the response variable during the learning period. The probability of obtaining this effect by chance is p = 0.095. This means the effect may be spurious and would generally not be considered statistically significant.

So a bit of a downturn, but this may be just chance. This isn’t entirely unsurprising, since teams are like the Ship of Theseus - at some point there’s enough turnover you have to ask if it’s the same team or not. The Yankees no long have Babe Ruth - nor, in fact, any member of their 1927 squad - playing for them, so they are in a sense a completely different team. Changes within a few years may be less drastic, but can still be impactful.

Now that we’ve decided this approach may have some philosophical flaws, we may as well look at the Astros. We’ll be brief here.

Show the code
l_mm_hou <- l_teams %>%
  filter(yearID >= 1962 & yearID <= 2019) %>%
  mutate(year_date = make_date(year = yearID, month = 1, day = 1)) %>%
  group_by(franchID) %>%
  mutate(first_year = min(yearID)) %>%
  filter(first_year <= 1962)


mm_hou <- MarketMatching::best_matches(data = l_mm_hou, 
                                   id = "franchID",
                                   date_variable = "year_date",
                                   matching_variable = "R",
                                   parallel = FALSE,
                                   markets_to_be_matched = "HOU",
                                   warping_limit = 1,
                                   dtw_emphasis = 0.1, 
                                   matches = 5, # request 5 matches
                                   start_match_period = "1962-01-01",
                                   end_match_period = "2012-01-01")

results_hou <- MarketMatching::inference(matched_markets = mm_hou, 
                                     test_market = "HOU", 
                                     analyze_betas = TRUE,
                                     end_post_period = "2019-01-01", 
                                     prior_level_sd = 0.1)
    ------------- Inputs -------------
    Market ID: franchID
    Date Variable: year_date
    Matching Metric: R

    Test Market: HOU
    Control Market 1: CHC
    Control Market 2: CHW
    Control Market 3: CLE
    Control Market 4: OAK
    Control Market 5: SFG

    Matching (pre) Period Start Date: 1962-01-01
    Matching (pre) Period End Date: 2012-01-01
    Post Period Start Date: 2013-01-01
    Post Period End Date: 2019-01-01

    bsts parameters: 
      prior.level.sd: 0.1
      No seasonality component (controlled for by the matched markets) 
    Posterior Intervals Tail Area: 95%

    ------------- Model Stats -------------
    Matching (pre) Period MAPE: 4.76%
    Beta 1 [CHC]: 0.4758
    Beta 2 [CHW]: 0.1653
    Beta 3 [CLE]: 0.2229
    Beta 4 [OAK]: 0.3016
    Beta 5 [SFG]: 0.1987
    DW: 1.6

    ------------- Effect Analysis -------------
    Absolute Effect: 518.33 [6.36, 993.44]
    Relative Effect: 11.03% [0.12%, 23.04%]
    Probability of a causal impact: 97.7107%

A MAPE of 4.8% is solid, though interestingly the control teams are more AL than NL.

Show the code
results_hou$PlotActualVersusExpected+
  theme_ipsum()+
  scale_color_manual(values = c('darkgrey', 'blue'))

We don’t really see an increase until the mid-2010s. While this may be because of the advantage of the DH, it may also have to do with stealing signals and the infamous trash can scandal. Houston was also famously data-driven at this time, and it’s possible they simply optimized via good drafts and trades to become a better team.

This is one of those questions that can be answered many different ways - we could also look at wRC+ and/or Runs Created from the DH spot and compare to average pitcher production. Regardless, the game’s more fun with a professional hitter there, even if it means we no longer get to see Bartolo Colon hit home runs.