Linear regression is one of the single most powerful tools any analyst can wield. That power comes with a whole lot of assumptions you’ll need to be careful of, not only in how the data is distributed but in how you set up your model.
To illustrate this, we’ll create a dummy dataset. It doesn’t really matter what it represents, but it’s nice to keep things concrete so we’ll say it’s monthly spend for an online retailed. We’ll create a segment with three classes of low, medium, and high, which add -15, 0, and 15 respectively to our value. We’ll also create three channels, “email”, “social”, and “search”, which add -5, 10, and 5 respectively, in addition to the “no channel” option, which adds 0.
We’ll generate values of around $100, with a standard deviation of $10 just to keep things interesting. Let’s take a quick look at our results.
Our “no channel” column is exactly what we’d expect given how we’ve set up the groups. The other channels are more difficult to read, though, so we can’t tell from them what we’d expect for the group. Let’s run a linear regression model on value using group as our term.
Show the code
lm_group <-lm(value ~ group, data = group_channels_booted)summary(lm_group)
Call:
lm(formula = value ~ group, data = group_channels_booted)
Residuals:
Min 1Q Median 3Q Max
-40.067 -7.728 0.040 7.486 42.954
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 116.3217 0.3565 326.28 <2e-16 ***
groupLow -29.7388 0.3845 -77.34 <2e-16 ***
groupMed -14.7191 0.4136 -35.59 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.26 on 9997 degrees of freedom
Multiple R-squared: 0.441, Adjusted R-squared: 0.4409
F-statistic: 3943 on 2 and 9997 DF, p-value: < 2.2e-16
There are a few things we note right away about the summary. Our “high” group isn’t listed out - it’s first alphabetically, so R uses it as a reference automatically. So the estimate is what the default group is, and the estimate for “Low” of -$30.17 is in comparison to the high group, and the estimate of -$14.91 for the medium group is also relative to the high group.
*We could relevel the factors if we wanted to. This can be appropriate depending on your use case, but it’s not always necessary.
This is fine if the intent is to build a predictive model. This would not be a great model, mind you, but we’d be more interested in if our terms were significant and the adjusted r-squared. But since we want to understand our data, we can use the emmeans package for display instead.
Show the code
emmeans(lm_group, ~ group)
group emmean SE df lower.CL upper.CL
High 116.32 0.357 9997 115.6 117.02
Low 86.58 0.144 9997 86.3 86.87
Med 101.60 0.210 9997 101.2 102.01
Confidence level used: 0.95
This is better - our groups are showing roughly the values we’d expected. But, of course, we know that channel also plays a roll, at least in our fictionalized data. What if we add that to the model?
Show the code
lm_both <-lm(value ~ group + channel, data = group_channels_booted)summary(lm_both)
Call:
lm(formula = value ~ group + channel, data = group_channels_booted)
Residuals:
Min 1Q Median 3Q Max
-35.014 -6.815 0.032 6.709 38.819
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 110.0050 0.3707 296.73 <2e-16 ***
groupLow -30.0387 0.3414 -87.99 <2e-16 ***
groupMed -15.0364 0.3672 -40.95 <2e-16 ***
channelNo Channel 5.0223 0.2629 19.10 <2e-16 ***
channelSearch 10.4527 0.3843 27.20 <2e-16 ***
channelSocial 15.4294 0.3163 48.78 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.998 on 9994 degrees of freedom
Multiple R-squared: 0.5596, Adjusted R-squared: 0.5594
F-statistic: 2540 on 5 and 9994 DF, p-value: < 2.2e-16
First, our model’s adjusted r-squared has jumped to 0.55, a full ten points from our group-only model. So we’re explaining more of the data this way. The baseline is now the high group with email, and since email carries a penalty in this dataset the intercept has gone down. Since we’re looking at two dimensions now, we can also see which have the largest impact - so while the default summary has its limits, it does help us find opportunities for where to focus our time.
Now that we’ve said summary can be useful, let’s go ahead and use emmeans instead anyway.
Show the code
emmeans(lm_both, ~ group)
group emmean SE df lower.CL upper.CL
High 117.73 0.322 9994 117.10 118.36
Low 87.69 0.142 9994 87.41 87.97
Med 102.69 0.196 9994 102.31 103.08
Results are averaged over the levels of: channel
Confidence level used: 0.95
Slightly different, erring a touch more but with smaller standard errors. Let’s add channel to the view.
Show the code
emmeans(lm_both, ~ group + channel)
group channel emmean SE df lower.CL upper.CL
High Email 110.00 0.371 9994 109.28 110.73
Low Email 79.97 0.236 9994 79.50 80.43
Med Email 94.97 0.272 9994 94.43 95.50
High No Channel 115.03 0.332 9994 114.38 115.68
Low No Channel 84.99 0.162 9994 84.67 85.31
Med No Channel 99.99 0.211 9994 99.58 100.40
High Search 120.46 0.434 9994 119.61 121.31
Low Search 90.42 0.324 9994 89.78 91.05
Med Search 105.42 0.351 9994 104.73 106.11
High Social 125.43 0.378 9994 124.69 126.18
Low Social 95.40 0.239 9994 94.93 95.86
Med Social 110.40 0.275 9994 109.86 110.94
Confidence level used: 0.95
This matches our cross-tab earlier almost perfectly. Which begs the question of why we did all this in the first place*. Well, in this instance, we knew the answers since we created the data in the first place. In real-world situations, though, you’re often given datasets where you don’t have a cheat sheet to do the heavy lifting. When this happens, having access to models will help you work out how different dimensions influence your data.
*The real answer, though, is “content”.
It’s also helpful for the less obvious groupings. We’ve modeled without channel, but what if we modeled without group instead?
Show the code
lm_channel <-lm(value ~ channel, data = group_channels_booted)emmeans(lm_channel, ~ channel)
That’s not great*. The adjusted r-squared is 0.103, and the values are off quite a ways from what we’d expected. Whereas if we go back to our model with both group and channel, we instead get:
*What is great is that we went from one dim to two and back to one, creating a chiasm.
Show the code
emmeans(lm_both, ~ channel)
channel emmean SE df lower.CL upper.CL
Email 94.98 0.235 9994 94.52 95.44
No Channel 100.00 0.164 9994 99.68 100.32
Search 105.43 0.324 9994 104.80 106.07
Social 110.41 0.241 9994 109.94 110.88
Results are averaged over the levels of: group
Confidence level used: 0.95
Just about perfect. The estimates are what we want and the standard errors are all smaller. It’s still too pristine, since in a real world situation you’d have to check for interactions between group and channel, as well as bring in other potentially important factors (device type, geography, tenure, etc.) to get a better estimate. You also are unlikely to have normally distributed data, though you can apply a log transformation to help with this. But knowing how you can take data and get good estimates will help you understand how your processes work and find ways to improve them.