Terms of Regression

statistics
r
probability
Author

Mark Jurries II

Published

September 22, 2025

Linear regression is one of the single most powerful tools any analyst can wield. That power comes with a whole lot of assumptions you’ll need to be careful of, not only in how the data is distributed but in how you set up your model.

To illustrate this, we’ll create a dummy dataset. It doesn’t really matter what it represents, but it’s nice to keep things concrete so we’ll say it’s monthly spend for an online retailed. We’ll create a segment with three classes of low, medium, and high, which add -15, 0, and 15 respectively to our value. We’ll also create three channels, “email”, “social”, and “search”, which add -5, 10, and 5 respectively, in addition to the “no channel” option, which adds 0.

We’ll generate values of around $100, with a standard deviation of $10 just to keep things interesting. Let’s take a quick look at our results.

Show the code
library(emmeans)
library(hrbrthemes)
library(gt)
library(gtExtras)
library(tidyverse)

set.seed(2025)

groups <- tribble(~group, ~group_effect, ~group_weight,
                'Low', -15, 0.6,
                'Med', 0, 0.3,
                'High', 15, 0.1)

channels <- tribble(~channel, ~channel_effect, ~channel_weight,
                    'No Channel', 0, 0.5,
                    'Email', -5, 0.2,
                    'Social', 10, 0.2,
                    'Search', 5, 0.1)

group_channels <- groups %>%
  cross_join(channels) %>%
  mutate(weight = group_weight * channel_weight) 

group_channels_booted <- slice_sample(group_channels, 
                                      n = 10000, 
                                      replace = TRUE,
                                      weight_by = weight) %>%
  mutate(value = rnorm(n(), 100, 10)) %>%
  mutate(value = value + group_effect + channel_effect)

group_channels_booted %>%
  group_by(group, channel) %>%
  summarise(value = mean(value)) %>%
  pivot_wider(names_from = channel, values_from = value) %>%
  ungroup() %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_currency(columns = 2:5, decimals = 0)
group Email No Channel Search Social
High $110 $115 $120 $126
Low $80 $85 $90 $95
Med $95 $100 $107 $110

Our “no channel” column is exactly what we’d expect given how we’ve set up the groups. The other channels are more difficult to read, though, so we can’t tell from them what we’d expect for the group. Let’s run a linear regression model on value using group as our term.

Show the code
lm_group <- lm(value ~ group, data = group_channels_booted)
summary(lm_group)

Call:
lm(formula = value ~ group, data = group_channels_booted)

Residuals:
    Min      1Q  Median      3Q     Max 
-40.067  -7.728   0.040   7.486  42.954 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 116.3217     0.3565  326.28   <2e-16 ***
groupLow    -29.7388     0.3845  -77.34   <2e-16 ***
groupMed    -14.7191     0.4136  -35.59   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.26 on 9997 degrees of freedom
Multiple R-squared:  0.441, Adjusted R-squared:  0.4409 
F-statistic:  3943 on 2 and 9997 DF,  p-value: < 2.2e-16

There are a few things we note right away about the summary. Our “high” group isn’t listed out - it’s first alphabetically, so R uses it as a reference automatically. So the estimate is what the default group is, and the estimate for “Low” of -$30.17 is in comparison to the high group, and the estimate of -$14.91 for the medium group is also relative to the high group.

*We could relevel the factors if we wanted to. This can be appropriate depending on your use case, but it’s not always necessary.

This is fine if the intent is to build a predictive model. This would not be a great model, mind you, but we’d be more interested in if our terms were significant and the adjusted r-squared. But since we want to understand our data, we can use the emmeans package for display instead.

Show the code
emmeans(lm_group, ~ group)
 group emmean    SE   df lower.CL upper.CL
 High  116.32 0.357 9997    115.6   117.02
 Low    86.58 0.144 9997     86.3    86.87
 Med   101.60 0.210 9997    101.2   102.01

Confidence level used: 0.95 

This is better - our groups are showing roughly the values we’d expected. But, of course, we know that channel also plays a roll, at least in our fictionalized data. What if we add that to the model?

Show the code
lm_both <- lm(value ~ group + channel, data = group_channels_booted)
summary(lm_both)

Call:
lm(formula = value ~ group + channel, data = group_channels_booted)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.014  -6.815   0.032   6.709  38.819 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       110.0050     0.3707  296.73   <2e-16 ***
groupLow          -30.0387     0.3414  -87.99   <2e-16 ***
groupMed          -15.0364     0.3672  -40.95   <2e-16 ***
channelNo Channel   5.0223     0.2629   19.10   <2e-16 ***
channelSearch      10.4527     0.3843   27.20   <2e-16 ***
channelSocial      15.4294     0.3163   48.78   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.998 on 9994 degrees of freedom
Multiple R-squared:  0.5596,    Adjusted R-squared:  0.5594 
F-statistic:  2540 on 5 and 9994 DF,  p-value: < 2.2e-16

First, our model’s adjusted r-squared has jumped to 0.55, a full ten points from our group-only model. So we’re explaining more of the data this way. The baseline is now the high group with email, and since email carries a penalty in this dataset the intercept has gone down. Since we’re looking at two dimensions now, we can also see which have the largest impact - so while the default summary has its limits, it does help us find opportunities for where to focus our time.

Now that we’ve said summary can be useful, let’s go ahead and use emmeans instead anyway.

Show the code
emmeans(lm_both, ~ group)
 group emmean    SE   df lower.CL upper.CL
 High  117.73 0.322 9994   117.10   118.36
 Low    87.69 0.142 9994    87.41    87.97
 Med   102.69 0.196 9994   102.31   103.08

Results are averaged over the levels of: channel 
Confidence level used: 0.95 

Slightly different, erring a touch more but with smaller standard errors. Let’s add channel to the view.

Show the code
emmeans(lm_both, ~ group + channel)
 group channel    emmean    SE   df lower.CL upper.CL
 High  Email      110.00 0.371 9994   109.28   110.73
 Low   Email       79.97 0.236 9994    79.50    80.43
 Med   Email       94.97 0.272 9994    94.43    95.50
 High  No Channel 115.03 0.332 9994   114.38   115.68
 Low   No Channel  84.99 0.162 9994    84.67    85.31
 Med   No Channel  99.99 0.211 9994    99.58   100.40
 High  Search     120.46 0.434 9994   119.61   121.31
 Low   Search      90.42 0.324 9994    89.78    91.05
 Med   Search     105.42 0.351 9994   104.73   106.11
 High  Social     125.43 0.378 9994   124.69   126.18
 Low   Social      95.40 0.239 9994    94.93    95.86
 Med   Social     110.40 0.275 9994   109.86   110.94

Confidence level used: 0.95 

This matches our cross-tab earlier almost perfectly. Which begs the question of why we did all this in the first place*. Well, in this instance, we knew the answers since we created the data in the first place. In real-world situations, though, you’re often given datasets where you don’t have a cheat sheet to do the heavy lifting. When this happens, having access to models will help you work out how different dimensions influence your data.

*The real answer, though, is “content”.

It’s also helpful for the less obvious groupings. We’ve modeled without channel, but what if we modeled without group instead?

Show the code
lm_channel <- lm(value ~ channel, data = group_channels_booted)
emmeans(lm_channel, ~ channel)
 channel    emmean    SE   df lower.CL upper.CL
 Email       87.48 0.314 9996    86.87    88.10
 No Channel  92.31 0.201 9996    91.92    92.71
 Search      97.86 0.446 9996    96.99    98.74
 Social     102.47 0.320 9996   101.84   103.10

Confidence level used: 0.95 

That’s not great*. The adjusted r-squared is 0.103, and the values are off quite a ways from what we’d expected. Whereas if we go back to our model with both group and channel, we instead get:

*What is great is that we went from one dim to two and back to one, creating a chiasm.
Show the code
emmeans(lm_both, ~ channel)
 channel    emmean    SE   df lower.CL upper.CL
 Email       94.98 0.235 9994    94.52    95.44
 No Channel 100.00 0.164 9994    99.68   100.32
 Search     105.43 0.324 9994   104.80   106.07
 Social     110.41 0.241 9994   109.94   110.88

Results are averaged over the levels of: group 
Confidence level used: 0.95 

Just about perfect. The estimates are what we want and the standard errors are all smaller. It’s still too pristine, since in a real world situation you’d have to check for interactions between group and channel, as well as bring in other potentially important factors (device type, geography, tenure, etc.) to get a better estimate. You also are unlikely to have normally distributed data, though you can apply a log transformation to help with this. But knowing how you can take data and get good estimates will help you understand how your processes work and find ways to improve them.