Linear Models and Race Performance

I ran a 5K this past weekend, finishing in 24:17:43 with a 7:49 pace. It was a flat course with 471 runners total participating. My time was considerably faster than I anticpated given my recent history, so I got to thinking how I could have calibrated my expectations better.

For starters, I could have listened to my Garmin, but it’s been telling me I can do sub-21 minute 5Ks for a while now, even though my all-time best is 22 minutes 22 seconds. So that seems helpful but not sufficient.

While that piece still needs work, it led me down the more interesting rabbit trail of seeing how my performance was compared to what a model would predict. I downloaded the results to play around with them. First, let’s look at the distribution.

Show the code

library(gt)
library(gtExtras)
library(hrbrthemes)
library(janitor)
library(lubridate)
library(sjPlot)
library(tidyverse)

race <- read_csv('race.csv') %>%
  clean_names() %>%
  mutate(chip_time_dec = as.numeric(chip_time))

race_avg <- mean(race$chip_time)

race %>%
  ggplot(aes(x = chip_time))+
  geom_density()+
  geom_vline(xintercept = race_avg)+
  theme_ipsum()+
  annotate("text", x= 1800, y=0.00015, label= paste("Mean", round(race_avg, 2)), angle=90)

Our average race pace is 1765.01 seconds, or 29.42 minutes. This seems about right, however; we see that the peak is below the mean, which is being driven upwards by a number of runners with long race times. We also know that there’s a difference between male and female runners, so let’s account for that.

Show the code

gender_mean <- race %>%
  group_by(gender) %>%
  summarise(mean_time = mean(chip_time))

race %>%
  ggplot(aes(x = chip_time, color = gender))+
  geom_density()+
  theme_ipsum()+
  scale_color_manual(values = c('#F8C8DC', '#A8DADC'))+
  geom_vline(data = gender_mean, aes(xintercept = mean_time, color = gender),
             linetype = "dashed")

We see a similar story for men and women - both have means dragged higher by slower runners. Maybe older runners are causing the drag. Let’s add age to our view to find out.

Show the code

race %>%
  ggplot(aes(x = age, y = chip_time, color = gender, fill = gender))+
  geom_point()+
  geom_smooth()+
  theme_ipsum()+
  scale_color_manual(values = c('#F8C8DC', '#A8DADC'))+
  scale_fill_manual(values = c('#F8C8DC', '#A8DADC'))

The lines are similar, though around 20 tend to be fastest. I saw several college track and field types, so this makes sense. Age has a factor, but the over 60 crowd is too small to make any conclusions from.

We’ve gotten as far as we can exploring the data, and since we’re missing info on training, general health, past performance, supershoes, etc. that the race website doesn’t provide (thankfully - that’d be a privacy nightmare), we’re somewhat limited. Still, just for kicks, we can run a regression model to see how important age is.

Our model only contains one observation per runner, so we don’t have to worry about fixed effects. Each runner is independent, so one time shouldn’t have an impact on another time. Finally, they’re all drawn from the same process, in this case, a race, so they’re all comparable. We’ll be using the interaction of gender and age as our terms,, and we’ll use sjPlot to model our results.

Show the code

lm_run <- lm(chip_time_dec ~ gender * age, data = race)

plot_model(lm_run, type = 'pred', terms = c('age', 'gender'))+
  theme_ipsum()+
  scale_color_manual(values = c('#A8DADC', '#F8C8DC'))+
  scale_fill_manual(values = c('#A8DADC', '#F8C8DC'))

This is a bit more readable than our scatterplot, but our goal isn’t readability, it’s understanding our coefficients and seeing just how much we can trust the model. We’ll use tab_model from sjPlot to get this.

Show the code

lm_run %>%
  tab_model()

	chip time dec
Predictors	Estimates	CI	p
(Intercept)	1780.69	1596.42 – 1964.96	<0.001
gender [M]	-420.33	-657.34 – -183.33	0.001
age	4.61	0.30 – 8.93	0.036
gender [M] × age	2.10	-3.45 – 7.64	0.458
Observations	470
R² / R² adjusted	0.125 / 0.119

So our baseline is 29.67 minutes. This is pretty close to our overall average, but not exactly. Men are predicted to run 7 minutes faster on average, though this ranges from 11 minutes to 3 minutes, which is pretty substantial. Every year older* adds 4.6 seconds to race time (between 0.3 and 9 seconds), which for a 5K is not massive unless you’re trying to place. The 4.6 seconds holds for both genders, while being a man and adding a year adds an additional 2 seconds to race time. However; this can range from -3.5 to 7.6 seconds, and since this interval includes zero, we don’t consider it significant.

*There’s some selection bias here, as older runners tend to have been running longer and have more experience. This shouldn’t be interpreted as being applicable to someone running their first race.

The model has an R² of 0.125, which is pretty weak. We kind of expected this, there’s a lot that goes into run performance that we’re not taking into account here. So while the model indicates I should expect to be about 5 seconds slower next year, it’s too weak a prediction to be super confident in. We could possibly get better results with XGBoost, bootstrapping our sample, etc.*, but for our purposes that’s overkill.

*The observant ready will notice I didn’t do a test/train split. My primary goal was understanding what affects time, but generally we should hold some data to we can test out of sample.

The missing variable concept is the biggest thing for an aspiring analyst to take out of this. Simply because you have a dataset doesn’t mean it’s capturing everything relevant. This doesn’t always mean there’s a magic variable that will get your R² to 0.8, but it does mean that thinking through the model assumptions and what factors go into results are key steps to the modeling process.

As for what the model thought of me, it expected somewhere between 26:33 and 28:22, so I beat its expectations. If the model was any good this would be satisfying, but as is I’m perfectly happy simply exceeding my own goal.