Groundhogs vs. Coins – Point Estimates

It’s Groundhog Day today, and since I’m in this endless time loop I got to thinking about the probabilities of a coin beating Punxsutawney Phil. A quick Google search suggested Phil’s accuracy was about 30% to 40% right, depending on who you asked. Critically, sites such as the Punxsutawney Groundhog Club and Wikipedia showed what he predicted, but not if it was right.

I finally found some data in this news article, which Gemini quickly turned into a tibble so I could analyze in R.

Show the code

library(hrbrthemes)
library(janitor)
library(gt)
library(gtExtras)
library(tidyverse)

groundhog_data <- tibble(
  year = c(2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025),
  shadow = c("No shadow", "Shadow", "Shadow", "No shadow", "Shadow", "Shadow", "No shadow", 
             "No shadow", "Shadow", "Shadow", "Shadow", "No shadow", "Shadow"),
  prediction = c("Right", "Right", "Wrong", "Right", "Wrong", "Wrong", "Wrong", 
                 "Right", "Wrong", "Wrong", "Wrong", "Right", "Wrong"),
  type = c("Early spring", "More winter", "More winter", "Early spring", "More winter", 
           "More winter", "Early spring", "Early spring", "More winter", "More winter", 
           "More winter", "Early spring", "More winter")
) %>%
  mutate(phil_right = ifelse(prediction == "Right", 1, 0),
         actual = case_when(phil_right == 1 ~ type,
                            type == 'Early sping' ~ 'More Winter',
                            TRUE ~ 'Early Spring')) 

groundhog_results <- groundhog_data %>%
  mutate(
    actual = case_when(
      prediction == "Right" ~ type,
      prediction == "Wrong" & type == "More winter" ~ "Early spring",
      prediction == "Wrong" & type == "Early spring" ~ "More winter"
    ),
    # Convert to factors for calculation
    type = factor(type, levels = c("More winter", "Early spring")),
    actual = factor(actual, levels = c("More winter", "Early spring"))
  )

accuracy_val <- mean(groundhog_results$prediction == "Right")


groundhog_data %>%
  summarise(n = n(),
            accuracy = mean(phil_right)) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_percent(columns = c('accuracy'), decimals = 1)

n	accuracy
13	38.5%

The bad news is that we only got 13 years. The good news is that it matches the other numbers I’d found, so it’s a fairly representative sample. Let’s check how he does based on seeing his shadow.

Show the code

groundhog_data %>%
  summarise(n = n(),
            precision = mean(phil_right),
            .by = shadow) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_percent(columns = c('precision'), decimals = 1)

shadow	n	precision
No shadow	5	80.0%
Shadow	8	12.5%

This is quite the split. When he “predicts” an early spring, he’s right 80% of the time. But if he calls for more winter, he’s only right 12.5% of the time. Note we’re labeling these precision - in machine learning, these define how often predictions are right about a certain outcome vs. overall. This split’s already telling us something.

On to the coin. We’ll flip a coin and if it’s heads, we’ll call for an early spring, and tails, more winter. We’ll do that for each year in our dataset, so all 13 years have a coin prediction. We’ll then do it a second time, then a third, and so on, until we have 10,000 samples. Thankfully we can do this in a few lines of code, as opposed to making Phil Connors do it in his rather ample downtime.

*Technically, we’re assigning to a bunch of 1s and 0s, and since the odds are 50%, it doesn’t really matter which gets what. So if it really matters to you that tails means early spring, go ahead and pretend I picked that.

Show the code

bootstrapped_sim <- map_dfr(1:10000, ~ {
  groundhog_data %>%
    # Sample with replacement to keep the same size (n=13)
    slice_sample(prop = 1, replace = TRUE) %>%
    mutate(
      id = .x,
      coin = rbinom(n(), 1, .5),
      coin_guess = ifelse(coin == 1, 'Early spring', 'More winter'),
      # We check the coin against the 'type' in that specific bootstrap sample
      coin_right = ifelse(tolower(coin_guess) == tolower(type), 1, 0)
    )
})

bootstrapped_sim %>%
  mutate(both_right = ifelse(coin_right == 1 & phil_right == 1, 1, 0),
         both_wrong = ifelse(coin_right == 0 & phil_right == 0, 1, 0),
         phil_only_right = ifelse(coin_right == 0 & phil_right == 1, 1, 0),
         coin_only_right = ifelse(coin_right == 1 & phil_right == 0, 1, 0)
  ) %>%
  summarise(phil_right = mean(phil_right),
            coin_right = mean(coin_right),
            both_right = mean(both_right),
            both_wrong = mean(both_wrong),
            phil_only_right = mean(phil_only_right),
            coin_only_right = mean(coin_only_right)) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_percent(decimals = 1)

phil_right	coin_right	both_right	both_wrong	phil_only_right	coin_only_right
38.5%	49.7%	19.2%	31.0%	19.3%	30.5%

We expected the coin to be right more often, and sure enough - with a coin toss, we were in fact right half the time. They could both be right (or wrong), but generally, you’re better off trusting the coin. What about by our spring/winter split?

Show the code

set.seed(9871)
bootstrapped_sim %>%
  mutate(both_right = ifelse(coin_right == 1 & phil_right == 1, 1, 0),
         both_wrong = ifelse(coin_right == 0 & phil_right == 0, 1, 0),
         phil_only_right = ifelse(coin_right == 0 & phil_right == 1, 1, 0),
         coin_only_right = ifelse(coin_right == 1 & phil_right == 0, 1, 0)
  ) %>%
  summarise(phil_right = mean(phil_right),
            coin_right = mean(coin_right),
            both_right = mean(both_right),
            both_wrong = mean(both_wrong),
            phil_only_right = mean(phil_only_right),
            coin_only_right = mean(coin_only_right),
            .by = type) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_percent(decimals = 1)

type	phil_right	coin_right	both_right	both_wrong	phil_only_right	coin_only_right
More winter	12.4%	49.7%	6.2%	44.1%	6.2%	43.5%
Early spring	79.9%	49.7%	39.8%	10.2%	40.1%	9.9%

So maybe you take Phil’s word for it if he calls for an early spring. But recall we only had 5 years for him, which is hardly a robust sample size. There are some techniques we could apply to try to work around this, but none that are worth the time to get the accuracy of a rodent’s weather predictions.

Finally, it’s helpful to check accuracy for all 10K permutations and compare to Phil’s actual performance. Our coin does a better job 71% of the time, so over the long run the evidence suggests it’s the better model. Typically when approaching a machine learning problem we’d use the coin model as a baseline we can improve from, something they should perhaps keep in mind when building a cyborg groundhog.

Show the code

bootstrapped_sim %>%
  summarise(phil_right = mean(phil_right),
            coin_right = mean(coin_right),
            .by = id) %>%
  ggplot(aes(x = coin_right))+
  geom_histogram()+
  geom_vline(aes(xintercept = accuracy_val))+
  theme_ipsum()+
  theme(text=element_text(size = 16,  family="Oswald"),
        panel.grid.minor = element_blank(),
        plot.title.position = "plot")+
  scale_x_continuous(labels = scales::percent)+
  ylab('Count')+
  xlab('Coin Accuracy')