It’s Groundhog Day today, and since I’m in this endless time loop I got to thinking about the probabilities of a coin beating Punxsutawney Phil. A quick Google search suggested Phil’s accuracy was about 30% to 40% right, depending on who you asked. Critically, sites such as the Punxsutawney Groundhog Club and Wikipedia showed what he predicted, but not if it was right.
I finally found some data in this news article, which Gemini quickly turned into a tibble so I could analyze in R.
The bad news is that we only got 13 years. The good news is that it matches the other numbers I’d found, so it’s a fairly representative sample. Let’s check how he does based on seeing his shadow.
This is quite the split. When he “predicts” an early spring, he’s right 80% of the time. But if he calls for more winter, he’s only right 12.5% of the time. Note we’re labeling these precision - in machine learning, these define how often predictions are right about a certain outcome vs. overall. This split’s already telling us something.
On to the coin. We’ll flip a coin and if it’s heads, we’ll call for an early spring, and tails, more winter. We’ll do that for each year in our dataset, so all 13 years have a coin prediction. We’ll then do it a second time, then a third, and so on, until we have 10,000 samples. Thankfully we can do this in a few lines of code, as opposed to making Phil Connors do it in his rather ample downtime.
*Technically, we’re assigning to a bunch of 1s and 0s, and since the odds are 50%, it doesn’t really matter which gets what. So if it really matters to you that tails means early spring, go ahead and pretend I picked that.
Show the code
bootstrapped_sim <-map_dfr(1:10000, ~ { groundhog_data %>%# Sample with replacement to keep the same size (n=13)slice_sample(prop =1, replace =TRUE) %>%mutate(id = .x,coin =rbinom(n(), 1, .5),coin_guess =ifelse(coin ==1, 'Early spring', 'More winter'),# We check the coin against the 'type' in that specific bootstrap samplecoin_right =ifelse(tolower(coin_guess) ==tolower(type), 1, 0) )})bootstrapped_sim %>%mutate(both_right =ifelse(coin_right ==1& phil_right ==1, 1, 0),both_wrong =ifelse(coin_right ==0& phil_right ==0, 1, 0),phil_only_right =ifelse(coin_right ==0& phil_right ==1, 1, 0),coin_only_right =ifelse(coin_right ==1& phil_right ==0, 1, 0) ) %>%summarise(phil_right =mean(phil_right),coin_right =mean(coin_right),both_right =mean(both_right),both_wrong =mean(both_wrong),phil_only_right =mean(phil_only_right),coin_only_right =mean(coin_only_right)) %>%gt() %>%gt_theme_espn() %>%fmt_percent(decimals =1)
phil_right
coin_right
both_right
both_wrong
phil_only_right
coin_only_right
38.5%
49.7%
19.2%
31.0%
19.3%
30.5%
We expected the coin to be right more often, and sure enough - with a coin toss, we were in fact right half the time. They could both be right (or wrong), but generally, you’re better off trusting the coin. What about by our spring/winter split?
So maybe you take Phil’s word for it if he calls for an early spring. But recall we only had 5 years for him, which is hardly a robust sample size. There are some techniques we could apply to try to work around this, but none that are worth the time to get the accuracy of a rodent’s weather predictions.
Finally, it’s helpful to check accuracy for all 10K permutations and compare to Phil’s actual performance. Our coin does a better job 71% of the time, so over the long run the evidence suggests it’s the better model. Typically when approaching a machine learning problem we’d use the coin model as a baseline we can improve from, something they should perhaps keep in mind when building a cyborg groundhog.