I was at a raffle drawing this past weekend with pretty low expectations. Typically, when I enter these things, I go home empty handed. Still, it was all for a good cause, so I was happy to buy some tickets. So I was surprised when one of my tickets was called out, equally shocked when another was called a minute later, and nonplussed when a third ticket won after that.
Naturally, since I’m the data guy, I was asked what the odds were. That question’s been rolling around my brain more than I care to admit*, so I decided to tackle it.
*I did just put this on the internet, so perhaps don’t take this literally.
The first step in something like this is defining the problem. The raffle was set up across multiple categories, each with a certain number of prizes. I’m being general here because I don’t recall the exact number of categories nor number of prizes. Nor do I know how many tickets total were entered. So that makes it difficult right up front.
We can estimate, however. There were about 120 people present, and we’ll assume they bought an average of 5 tickets each. Their welcome packets contained one free ticket *, so that’s six tickets on average. That gives us 720 tickets total.
*A fact I didn’t realize until I got home, but we’ll assume everyone else was more observant.
However; I only entered in two categories, so I wasn’t up against that many. I believe there were seven categories total, so we’ll go with that. One of them was a grill, which we can conclude attracted more tickets. We’ll say the grill got 150 ticket, leave 570 left over. That conveniently leaves us with 570/6 = 95, so we have a nice round number to work with.
Since I entered five tickets each in two categories, my odds of winning in a single categories were 5/95, or about 5.2%. All we need now are the total number of prizes in each, which we’ll estimate at 7 and 8.
From here, we could just use the dbinom function in R to calculate the odds. If you want to do this in Excel, the BINOM.DIST function is your friend. Since these are independent odds, we’ll multiply them together - this will give us the odds of winning two items from set A and 1 from set B.
OK, 1%. That’s unlikely, but not impossible. It’s also dependent on our assumptions of tickets sold and prizes offered. We’ll use our current assumptions in the next section, but we can take a quick break to see how the odds change for various assumptions. We’ll only look at one raffle and check the odds of winning exactly two prizes.
*I’d already written most of this code when I saw this great video explaning Monte Carlo, which I watched over dinner as cool people are wont to do. Follow the channel, it’s highly educational and entertaining.
Show the code
prize_count <-seq(from =5, to =10) %>%as_tibble() %>%rename(prize_count = value)tickets <-seq(from =10, to =100) %>%as_tibble() %>%rename(total_tickets_entered = value)raffle_permutations <- prize_count %>%crossing(tickets) %>%mutate(tickets_entered =5,odds = tickets_entered / total_tickets_entered) %>%rowwise() %>%mutate(binom_odds =dbinom(2, prize_count, odds),p_binom_odd =pbinom(2, prize_count, odds))raffle_permutations %>%ggplot(aes(x = total_tickets_entered, y = binom_odds, color =as.factor(prize_count)))+geom_line()+theme_ipsum()+ggtitle('Odds of Winning Exactly Two Prizes')+labs(color ="Prizes in Raffle")+scale_color_ipsum()
There’s some funkiness with the lower total tickets entered, which is counter-intuitively expected - we’re looking for exactly two, and there’s a lot of random chance with small sample sizes. Our best bet is to hope for about 20 tickets entered, regardless of prizes available. My assumption is much higher, though, so for better or worse we’ll stick with that.
Next, we’ll move on to simulating the data, sometimes called Monte Carlo* simulation or bootstrapping. The advantage here is that we can investigate individual sims and see what the details are. We’ll write a function that accepts tickets entered, total tickets, prizes, and number of sims as arguments. It will generate a set of tickets, then draw for the total number of prizes. Since this is a fairly basic simulation the compute power needed is pretty low - running 50K takes less than a second on my laptop.
Let’s run this, then take a look at an example draw from group A.
In this sim, we won in round 1 and lost the other draws. We could look at the remaining 49,999 individually, or we could total them up and look at that. Let’s save ourselves a few hours and do that. We’ll look first at the odds of winning n prizes first.
Show the code
both_groups_agg %>%count(total_won) %>%mutate(p = n /sum(n),cumulative_p =cumsum(p)) %>%gt() %>%gt_theme_espn() %>%fmt_percent(columns =c(p, cumulative_p), decimals =1) %>%fmt_number(columns = n, decimals =0)
total_won
n
p
cumulative_p
0
21,378
42.8%
42.8%
1
19,505
39.0%
81.8%
2
7,336
14.7%
96.4%
3
1,564
3.1%
99.6%
4
200
0.4%
100.0%
5
15
0.0%
100.0%
6
2
0.0%
100.0%
There’s a 43% chance of winning zero prizes, which is the highest individual outcome, but it also means I had a better than 50% chance of winning going in, though only a 3% chance of winning 3 times.
That’s higher than our earlier numbers - that’s because we were focused on winning 2 from A and 1 from B. Let’s check the odds of each permutation.
There were other ways to get to 3 - win 0 and 3, 3 and 0, or 1 and 2. Also note that our 2 and 1 combo has odds of 1.3%, approximating our dbinom function from earlier. While I’d still use dbinom for a more precise number, having our data simulated like this helps us see the broader range of possibilities. But even that has its limits - let’s plot this instead, showing prizes won in group A on the x axis and group B on the y axis.
Show the code
agg_counts <- both_groups_agg %>%count(group_a_won, group_b_won, total_won) %>%mutate(p = n /sum(n)) agg_counts %>%ggplot(aes(x = group_a_won, y = group_b_won, fill = p))+geom_tile()+geom_text(aes(label = scales::label_percent(accuracy =0.01)(p)))+theme_ipsum()+theme(legend.position ="none",panel.grid.major =element_blank(),panel.grid.minor =element_blank())
There are no records in this simulation where 3 prizes were won in group A and more than 3 won in group B. Indeed, group A tops out at 4, while B maxes at 4. We can reasonably expect that the odds of winning 4 in each are infinitesimally small, but if we wanted to know that number we’d need dbinom instead.
It’s also worth noting that Monte Carlo, at least as executed here, isn’t replicatable. You’ll get close, but it’ll change a bit each time, even if you remember to set the seed. It’s not a huge deal in most cases, but something you’ll want to keep in mind.
Which technique you use will depend on your needs. If you need to know every possible combination, you’re better off creating a pair of tibbles with the info you need, crossing them, then reporting the values. The code for doing so earlier is simple and quick to run.
On the other hand, iIf you’re willing to accept a loss in precision for a gain in explainability (it’s far easier to say you simulated 50,000 raffle draws than it is to explain binomial probability), then Monte Carlo simulation is the way to go. Regardless, they’re both valuable tools to have on your belt.