As of start of day Monday, June 23, there were 2,574 home runs hit in the 2025 MLB season. Some are line drives, some are moonshots. Some players always manage to just get them out, others leave no doubt.
We can classify home run types using data from Baseball Savant, which offers hit-level data so we can see the distance, launch angle, and launch speed for each home run hit. We’ll use k-means clustering, an unsupervised algorithm, to set our types. Before we start training our model to classify types, let’s first check each of the aforementioned stats to see their distributions.
These each look roughly normal - hit distance skews a little right, launch angle a little left, but nothing super drastic. Launch angle isn’t as strongly related to speed and distance, but it’s aesthetically important so we’ll keep it.
We don’t know how many clusters there are. We could manually set some limits, but we want to go in with just what we told the model and see what comes back. In a real world setting, this sort of example won’t necessarily lead to a final outcome, but it gives a nice baseline for us to work with and tweak.
We’ll use the example from the Tidymodels website to figure out how many clusters we want.
There are valid arguments to be made for 3 and 4 here. The returns diminish somewhat after 3, so we’ll go with that. Now let’s see what we can learn about our groupings.
Note that our algoiritm really focused in on distance. Cluster 2 are the shortest hit with the highest launch angle, cluster 1 are the furthest hit with the highest speed, and cluster 3 is middle of the pack. We’ll call cluster 1 lasers, 2 moonshots, and 3 typicals.
Not that that’s settled, who hits the most of each kind? They all count, so it’s more trivia than anything, but fun to know. We’ll filter on all players with at least 15 home runs.
Cal Raleigh not only leads baseball in home runs, but he’s dominating the moonshots category with 15, almost twice as many as runners-up Pete Crow-Armstrong and Issac Paredes (8 each). Suarez has a league-leading 15 lasers, while Shohei Ohtani leads in typicals with 12 - which instantly makes me suspect my naming convention, since nothing Ohtani does is typical. And if we look at his other categories, we see that he tends to lift the ball more on his home runs, eschewing the lower launch angles of his merely mortal peers.
So there’s clustering - a quick, relatively painless way to get you started categorizing your data.