Home Run Types – Point Estimates

As of start of day Monday, June 23, there were 2,574 home runs hit in the 2025 MLB season. Some are line drives, some are moonshots. Some players always manage to just get them out, others leave no doubt.

We can classify home run types using data from Baseball Savant, which offers hit-level data so we can see the distance, launch angle, and launch speed for each home run hit. We’ll use k-means clustering, an unsupervised algorithm, to set our types. Before we start training our model to classify types, let’s first check each of the aforementioned stats to see their distributions.

Show the code

library(GGally)
library(gt)
library(gtExtras)
library(hrbrthemes)
library(tidymodels)
library(tidyverse)

set.seed(2025)

savant_data <- read_csv('savant_data.csv')

savant_data %>%
  ggpairs(columns = c('launch_angle', 'launch_speed', 'hit_distance_sc'))+
  theme_ipsum()

These each look roughly normal - hit distance skews a little right, launch angle a little left, but nothing super drastic. Launch angle isn’t as strongly related to speed and distance, but it’s aesthetically important so we’ll keep it.

We don’t know how many clusters there are. We could manually set some limits, but we want to go in with just what we told the model and see what comes back. In a real world setting, this sort of example won’t necessarily lead to a final outcome, but it gives a nice baseline for us to work with and tweak.

We’ll use the example from the Tidymodels website to figure out how many clusters we want.

Show the code

savant_prepped_with_names <- savant_data %>%
  select(player_name, launch_speed, launch_angle, hit_distance_sc) %>%
  na.exclude()

savant_prepped <- savant_prepped_with_names %>%
  select(-player_name)

kclusts <- 
  tibble(k = 1:9) %>%
  mutate(
    kclust = map(k, ~kmeans(savant_prepped, .x)) ,
    tidied = map(kclust, tidy),
    glanced = map(kclust, glance),
    augmented = map(kclust, augment, savant_prepped)
  )

clusters <- 
  kclusts %>%
  unnest(cols = c(tidied))

assignments <- 
  kclusts %>% 
  unnest(cols = c(augmented))

clusterings <- 
  kclusts %>%
  unnest(cols = c(glanced))

ggplot(clusterings, aes(k, tot.withinss)) +
  geom_line() +
  geom_point()+
  theme_ipsum()

There are valid arguments to be made for 3 and 4 here. The returns diminish somewhat after 3, so we’ll go with that. Now let’s see what we can learn about our groupings.

Show the code

assignments %>%
  filter(k == 3) %>%
  ggpairs(columns = c('launch_angle', 'launch_speed', 'hit_distance_sc'),
          aes(color = as.factor(.cluster), alpha = 0.3))+
  theme_ipsum()+
  scale_color_manual(values = c('#9B4F4F', '#5E7A8A', '#6E7D5A'))

Show the code

assignments %>%
  filter(k == 3) %>%
  group_by(.cluster) %>%
  summarise(launch_speed = mean(launch_speed),
            launch_angle = mean(launch_angle),
            hit_distance_sc = mean(hit_distance_sc)) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_number(columns = 2:4, decimals = 1)

.cluster	launch_speed	launch_angle	hit_distance_sc
1	104.7	28.5	394.8
2	101.4	30.2	364.0
3	107.8	27.9	424.0

Note that our algoiritm really focused in on distance. Cluster 2 are the shortest hit with the highest launch angle, cluster 1 are the furthest hit with the highest speed, and cluster 3 is middle of the pack. We’ll call cluster 1 lasers, 2 moonshots, and 3 typicals.

Not that that’s settled, who hits the most of each kind? They all count, so it’s more trivia than anything, but fun to know. We’ll filter on all players with at least 15 home runs.

Show the code

assignments %>%
  filter(k == 3) %>%
  select(-k, -kclust, -tidied, -glanced) %>%
  cbind(savant_prepped_with_names %>% select(player_name)) %>%
  group_by(player_name) %>%
  summarise(HR = n(),
            lasers = sum(ifelse(.cluster == 1, 1, 0)),
            moonshots = sum(ifelse(.cluster == 2, 1, 0)),
            typicals = sum(ifelse(.cluster == 3, 1, 0)),
            avg_launch_speed = mean(launch_speed),
            avg_launch_angle = mean(launch_angle),
            avg_hit_distance_sc = mean(hit_distance_sc)) %>%
  arrange(desc(HR)) %>%
  filter(HR >= 15) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_number(columns = 6:8, decimals = 1)

player_name	HR	lasers	moonshots	typicals	avg_launch_speed	avg_launch_angle	avg_hit_distance_sc
Raleigh, Cal	31	8	15	8	105.0	32.3	383.6
Judge, Aaron	27	8	7	12	107.6	30.9	405.2
Ohtani, Shohei	26	12	4	10	109.0	29.6	401.3
Suárez, Eugenio	25	8	2	15	105.2	31.6	416.7
Schwarber, Kyle	24	8	2	14	107.6	28.8	414.5
Crow-Armstrong, Pete	21	8	8	5	105.7	31.6	390.9
Wood, James	21	7	2	12	110.1	27.2	415.0
Carroll, Corbin	20	7	3	10	105.5	28.1	403.6
Suzuki, Seiya	20	11	4	5	106.5	26.6	399.5
Caminero, Junior	19	8	7	4	104.7	26.8	389.4
Ward, Taylor	19	7	5	7	103.5	30.1	392.5
Alonso, Pete	18	5	3	10	107.3	27.6	409.2
Buxton, Byron	17	8	2	7	106.0	30.4	407.9
De La Cruz, Elly	17	6	3	8	108.5	27.2	408.0
Greene, Riley	17	6	6	5	107.9	30.1	394.9
O'Hoppe, Logan	17	9	3	5	104.8	31.1	401.5
Adell, Jo	16	6	1	9	107.6	28.1	411.9
Devers, Rafael	16	7	5	4	106.4	27.8	396.0
Lindor, Francisco	16	9	4	3	103.8	30.3	392.2
Muncy, Max	16	10	0	6	106.7	29.4	406.3
Pages, Andy	16	10	3	3	104.1	28.8	394.6
Paredes, Isaac	16	8	8	0	101.0	31.2	380.9
Soto, Juan	16	10	2	4	107.4	29.8	399.2
Torkelson, Spencer	16	9	3	4	105.2	28.8	397.5
Grisham, Trent	15	7	5	3	102.1	30.8	386.0
Lowe, Brandon	15	11	1	3	105.4	29.3	399.9
Nimmo, Brandon	15	7	4	4	103.6	30.0	392.4
Olson, Matt	15	4	4	7	106.4	28.0	396.7
Rooker, Brent	15	8	2	5	105.1	27.8	400.2
Tucker, Kyle	15	10	2	3	103.4	29.7	395.5

Cal Raleigh not only leads baseball in home runs, but he’s dominating the moonshots category with 15, almost twice as many as runners-up Pete Crow-Armstrong and Issac Paredes (8 each). Suarez has a league-leading 15 lasers, while Shohei Ohtani leads in typicals with 12 - which instantly makes me suspect my naming convention, since nothing Ohtani does is typical. And if we look at his other categories, we see that he tends to lift the ball more on his home runs, eschewing the lower launch angles of his merely mortal peers.

So there’s clustering - a quick, relatively painless way to get you started categorizing your data.