Home Run Types

statistics
r
baseball
Author

Mark Jurries II

Published

June 24, 2025

As of start of day Monday, June 23, there were 2,574 home runs hit in the 2025 MLB season. Some are line drives, some are moonshots. Some players always manage to just get them out, others leave no doubt.

We can classify home run types using data from Baseball Savant, which offers hit-level data so we can see the distance, launch angle, and launch speed for each home run hit. We’ll use k-means clustering, an unsupervised algorithm, to set our types. Before we start training our model to classify types, let’s first check each of the aforementioned stats to see their distributions.

Show the code
library(GGally)
library(gt)
library(gtExtras)
library(hrbrthemes)
library(tidymodels)
library(tidyverse)

set.seed(2025)

savant_data <- read_csv('savant_data.csv')

savant_data %>%
  ggpairs(columns = c('launch_angle', 'launch_speed', 'hit_distance_sc'))+
  theme_ipsum()

These each look roughly normal - hit distance skews a little right, launch angle a little left, but nothing super drastic. Launch angle isn’t as strongly related to speed and distance, but it’s aesthetically important so we’ll keep it.

We don’t know how many clusters there are. We could manually set some limits, but we want to go in with just what we told the model and see what comes back. In a real world setting, this sort of example won’t necessarily lead to a final outcome, but it gives a nice baseline for us to work with and tweak.

We’ll use the example from the Tidymodels website to figure out how many clusters we want.

Show the code
savant_prepped_with_names <- savant_data %>%
  select(player_name, launch_speed, launch_angle, hit_distance_sc) %>%
  na.exclude()

savant_prepped <- savant_prepped_with_names %>%
  select(-player_name)

kclusts <- 
  tibble(k = 1:9) %>%
  mutate(
    kclust = map(k, ~kmeans(savant_prepped, .x)) ,
    tidied = map(kclust, tidy),
    glanced = map(kclust, glance),
    augmented = map(kclust, augment, savant_prepped)
  )

clusters <- 
  kclusts %>%
  unnest(cols = c(tidied))

assignments <- 
  kclusts %>% 
  unnest(cols = c(augmented))

clusterings <- 
  kclusts %>%
  unnest(cols = c(glanced))

ggplot(clusterings, aes(k, tot.withinss)) +
  geom_line() +
  geom_point()+
  theme_ipsum()

There are valid arguments to be made for 3 and 4 here. The returns diminish somewhat after 3, so we’ll go with that. Now let’s see what we can learn about our groupings.

Show the code
assignments %>%
  filter(k == 3) %>%
  ggpairs(columns = c('launch_angle', 'launch_speed', 'hit_distance_sc'),
          aes(color = as.factor(.cluster), alpha = 0.3))+
  theme_ipsum()+
  scale_color_manual(values = c('#9B4F4F', '#5E7A8A', '#6E7D5A'))

Show the code
assignments %>%
  filter(k == 3) %>%
  group_by(.cluster) %>%
  summarise(launch_speed = mean(launch_speed),
            launch_angle = mean(launch_angle),
            hit_distance_sc = mean(hit_distance_sc)) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_number(columns = 2:4, decimals = 1)
.cluster launch_speed launch_angle hit_distance_sc
1 104.7 28.5 394.8
2 101.4 30.2 364.0
3 107.8 27.9 424.0

Note that our algoiritm really focused in on distance. Cluster 2 are the shortest hit with the highest launch angle, cluster 1 are the furthest hit with the highest speed, and cluster 3 is middle of the pack. We’ll call cluster 1 lasers, 2 moonshots, and 3 typicals.

Not that that’s settled, who hits the most of each kind? They all count, so it’s more trivia than anything, but fun to know. We’ll filter on all players with at least 15 home runs.

Show the code
assignments %>%
  filter(k == 3) %>%
  select(-k, -kclust, -tidied, -glanced) %>%
  cbind(savant_prepped_with_names %>% select(player_name)) %>%
  group_by(player_name) %>%
  summarise(HR = n(),
            lasers = sum(ifelse(.cluster == 1, 1, 0)),
            moonshots = sum(ifelse(.cluster == 2, 1, 0)),
            typicals = sum(ifelse(.cluster == 3, 1, 0)),
            avg_launch_speed = mean(launch_speed),
            avg_launch_angle = mean(launch_angle),
            avg_hit_distance_sc = mean(hit_distance_sc)) %>%
  arrange(desc(HR)) %>%
  filter(HR >= 15) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_number(columns = 6:8, decimals = 1)
player_name HR lasers moonshots typicals avg_launch_speed avg_launch_angle avg_hit_distance_sc
Raleigh, Cal 31 8 15 8 105.0 32.3 383.6
Judge, Aaron 27 8 7 12 107.6 30.9 405.2
Ohtani, Shohei 26 12 4 10 109.0 29.6 401.3
Suárez, Eugenio 25 8 2 15 105.2 31.6 416.7
Schwarber, Kyle 24 8 2 14 107.6 28.8 414.5
Crow-Armstrong, Pete 21 8 8 5 105.7 31.6 390.9
Wood, James 21 7 2 12 110.1 27.2 415.0
Carroll, Corbin 20 7 3 10 105.5 28.1 403.6
Suzuki, Seiya 20 11 4 5 106.5 26.6 399.5
Caminero, Junior 19 8 7 4 104.7 26.8 389.4
Ward, Taylor 19 7 5 7 103.5 30.1 392.5
Alonso, Pete 18 5 3 10 107.3 27.6 409.2
Buxton, Byron 17 8 2 7 106.0 30.4 407.9
De La Cruz, Elly 17 6 3 8 108.5 27.2 408.0
Greene, Riley 17 6 6 5 107.9 30.1 394.9
O'Hoppe, Logan 17 9 3 5 104.8 31.1 401.5
Adell, Jo 16 6 1 9 107.6 28.1 411.9
Devers, Rafael 16 7 5 4 106.4 27.8 396.0
Lindor, Francisco 16 9 4 3 103.8 30.3 392.2
Muncy, Max 16 10 0 6 106.7 29.4 406.3
Pages, Andy 16 10 3 3 104.1 28.8 394.6
Paredes, Isaac 16 8 8 0 101.0 31.2 380.9
Soto, Juan 16 10 2 4 107.4 29.8 399.2
Torkelson, Spencer 16 9 3 4 105.2 28.8 397.5
Grisham, Trent 15 7 5 3 102.1 30.8 386.0
Lowe, Brandon 15 11 1 3 105.4 29.3 399.9
Nimmo, Brandon 15 7 4 4 103.6 30.0 392.4
Olson, Matt 15 4 4 7 106.4 28.0 396.7
Rooker, Brent 15 8 2 5 105.1 27.8 400.2
Tucker, Kyle 15 10 2 3 103.4 29.7 395.5

Cal Raleigh not only leads baseball in home runs, but he’s dominating the moonshots category with 15, almost twice as many as runners-up Pete Crow-Armstrong and Issac Paredes (8 each). Suarez has a league-leading 15 lasers, while Shohei Ohtani leads in typicals with 12 - which instantly makes me suspect my naming convention, since nothing Ohtani does is typical. And if we look at his other categories, we see that he tends to lift the ball more on his home runs, eschewing the lower launch angles of his merely mortal peers.

So there’s clustering - a quick, relatively painless way to get you started categorizing your data.