While coefficients in OLS are fairly intuitive, it is often helpful to estimate average marginal effects, conditional marginal effects, and predicted values from regression models. This tutorial provides an overview of how to obtain and present these estimates.
The first thing to do is to load the necessary packages.
library(tidyverse) # always load the tidy tools
library(haven) # to read in Stata .dta files
library(marginaleffects) # package for estimating margins
Let’s read in some educational data for our example.
mlmdata <- read_dta("https://stats.idre.ucla.edu/stat/examples/imm/imm10.dta")
glimpse(mlmdata)
## Rows: 260
## Columns: 19
## $ schid <dbl> 7472, 7472, 7472, 7472, 7472, 7472, 7472, 7472, 7472, 7472, 7~
## $ stuid <dbl> 3, 8, 13, 17, 27, 28, 30, 36, 37, 42, 52, 53, 61, 64, 72, 83,~
## $ ses <dbl> -0.13, -0.39, -0.80, -0.72, -0.74, -0.58, -0.83, -0.51, -0.56~
## $ meanses <dbl> -0.4826087, -0.4826087, -0.4826087, -0.4826087, -0.4826087, -~
## $ homework <dbl> 1, 0, 0, 1, 2, 1, 5, 1, 1, 2, 1, 1, 1, 2, 1, 4, 1, 2, 1, 1, 1~
## $ white <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ parented <dbl> 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, 2, 3, 2, 1, 2, 3, 3, 1, 3, 3, 3~
## $ public <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ ratio <dbl> 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 1~
## $ percmin <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ math <dbl> 48, 48, 53, 42, 43, 57, 33, 64, 36, 56, 48, 48, 44, 35, 50, 3~
## $ sex <dbl> 2, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2~
## $ race <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4~
## $ sctype <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ cstr <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
## $ scsize <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3~
## $ urban <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
## $ region <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
## $ schnum <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
Create a dummy variable for female. Estimate a model predicting mathematics scores.
mlmdata$female <- ifelse(mlmdata$sex == 2, 1, 0)
ols <- lm(math ~ homework + ses + female + white, mlmdata)
summary(ols)
##
## Call:
## lm(formula = math ~ homework + ses + female + white, data = mlmdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.9356 -5.8326 0.5589 5.2318 23.8706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.7198 1.4598 32.005 < 2e-16 ***
## homework 2.1654 0.3648 5.936 9.51e-09 ***
## ses 5.5351 0.6394 8.656 5.61e-16 ***
## female -0.2930 1.0254 -0.286 0.775
## white 1.0309 1.2741 0.809 0.419
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.244 on 255 degrees of freedom
## Multiple R-squared: 0.4604, Adjusted R-squared: 0.4519
## F-statistic: 54.39 on 4 and 255 DF, p-value: < 2.2e-16
Homework and socio-economic status are positively and significantly associated with math scores. Gender and race differences in math scores are not statistically significant.
Average marginal effects (AME) refer to the mean differences in the outcome for each unit of the independent variables. In a basic OLS regression with no interactions and no curvilinear terms, AME is analogous to the B coefficients.
head(marginaleffects(ols))
## rowid type term dydx std.error math homework ses female white
## 1 1 response homework 2.16542 0.3648097 48 1 -0.13 1 1
## 2 2 response homework 2.16542 0.3648099 48 0 -0.39 0 1
## 3 3 response homework 2.16542 0.3648101 53 0 -0.80 0 1
## 4 4 response homework 2.16542 0.3648097 42 1 -0.72 0 1
## 5 5 response homework 2.16542 0.3648099 43 2 -0.74 1 1
## 6 6 response homework 2.16542 0.3648097 57 1 -0.58 1 1
The marginaleffects command generates a series of AME estimates conditional on the other variables. Just the first six are presented here. Notice that the effect values for homework (dydx) are identical to its coefficient in the model, regardless of the conditions (see the variables to the right).
So why are AMEs useful? Why not just rely on the coefficient estimates? Well, AMEs can be helpful in interpreting more complex models (like generalized linear models, logit models, count models, etc.). I will address this issue in another tutorial. But a second reason, and one that applies to this tutorial, is that AMEs can be useful for evaluation of conditional effects and interaction effects.
Let’s review conditional effects. First, I will use the plot_cme command to display the AME for a single variable (SES), conditional on another variable (white).
plot_cme(ols, effect = "ses", condition = c("white"))
The above script produces a straight horizontal line, suggesting that the marginal effects of SES do not vary across conditions of race. This is because this model estimates independent effects of SES and race, which is to say, individual effects are not contingent on the effects of the other variables.
Now let’s see what happens to the margins if we add an interaction effect between SES and race to the original.
ols2 <- lm(math ~ homework + ses + female + white + ses:white, mlmdata)
plot_cme(ols2, effect = "ses", condition = c("white"))
The new plot shows that the effects of SES are contingent on race. Those effects are stronger for white students (white = 1) than for non-white students (white = 0). An examination of the model estimates shows that the interaction effect (SES*white) is statistically significant.
summary(ols2)
##
## Call:
## lm(formula = math ~ homework + ses + female + white + ses:white,
## data = mlmdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.378 -5.556 0.102 5.108 22.677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.7582 1.5630 28.637 < 2e-16 ***
## homework 2.1021 0.3591 5.854 1.48e-08 ***
## ses 2.4107 1.1708 2.059 0.04050 *
## female -0.6990 1.0159 -0.688 0.49201
## white 3.1287 1.4170 2.208 0.02813 *
## ses:white 4.2171 1.3333 3.163 0.00175 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.102 on 254 degrees of freedom
## Multiple R-squared: 0.4808, Adjusted R-squared: 0.4706
## F-statistic: 47.05 on 5 and 254 DF, p-value: < 2.2e-16
In addition to estimating average and conditional marginal effects, we might also want to generate predicted values for various groups in the models. For this, I’ll start with the orginal model without the interaction term.
plot_cap(ols, condition = c("ses"))
Note that the scale for the y-axis is different. Instead of estimating marginal effects (i.e., regression coefficient estimates or B values), the plot_cap command estimates predicted values on the dependent variable. The range of predicted values is from math scores of about 35 to 65. Now let’s add race as an additional condition.
plot_cap(ols, condition = c("ses", "white"))
Instead of a single line, we now get two lines or separate plots for white and non-white students. Those two lines are parallel, with white students having a constant advantage in math scores at each level of SES.
Remember that this is the non-interactive model, which means that we have modeled the effects of SES and race to be independent of one another. Thus, the effect of SES is unaffected by race. The constant advantage of whites over non-whites is equal to the white coefficient from that model (1.0309).
Now let’s run plot_cap again, this time to examine the predictions from the interactive model.
plot_cap(ols2, condition = c("ses", "white"))
The inclusion of the statistical interaction allows the predicted values for SES to vary across race categories. The plot suggests that SES has a greater positive impact on math scores for white students than non-white students.
The error regions of the graph also show us where we can be especially confident of the differences in those effects. In this case, the predictions significantly diverge at around SES > .25, where the error regions separate.
This shows an interaction between a categorical and a continuous variable. How about plotting predictions from an interaction between two continuous variables?
ols3 <- lm(math ~ homework + ses + female + white + homework:ses, mlmdata)
plot_cap(ols3, condition = c("homework", "ses"))
The ols3 model estimates an interaction between SES and homework. The plot_cap command shows how the homework predictions vary across different levels of SES. The SES levels are transformed into quintiles and plotted as 5 separate lines.
The plot shows vertical distances between the lines (reflecting SES differences), but the slopes are very similar. This shows that there is no statistical interaction between these two factors (which can be confirmed by reviewing the model statistics).
How about non-linear estimates from models? Below I add a quadratic term to the model to estimate whether there are any threshold effects on homework. Then we can examine differences in the marginal effects and predictions.
ols4 <- lm(math ~ homework + I(homework^2) + ses + female + white , mlmdata)
plot_cme(ols4, effect = "homework", condition = c("homework"))
plot_cap(ols4, condition = c("homework"))
The addition of the squared term for homework allows the effect of homework on math scores to vary across different values. And the cme plot does show a modest increase in the homework effect on math scores as people devote more hours to homework.
The predictions reflect the increasing homework returns to math scores. But the upward curve is subtle and inspection of the model estimates show that…
summary(ols4)
##
## Call:
## lm(formula = math ~ homework + I(homework^2) + ses + female +
## white, data = mlmdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.3299 -5.8265 0.5769 5.2049 22.5578
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.0629 1.7750 27.078 < 2e-16 ***
## homework 0.6778 1.1796 0.575 0.566
## I(homework^2) 0.2789 0.2103 1.326 0.186
## ses 5.5651 0.6389 8.711 3.94e-16 ***
## female -0.2622 1.0241 -0.256 0.798
## white 0.8175 1.2823 0.638 0.524
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.232 on 254 degrees of freedom
## Multiple R-squared: 0.4641, Adjusted R-squared: 0.4536
## F-statistic: 44 on 5 and 254 DF, p-value: < 2.2e-16
…neither the base term nor the quadratic term for homework are statistically significant. A linear term alone offers a much better fit to the data.
We might want to generate predictions for very specific conditions and contrasts. For this we can turn to the predictions command, which creates a datagrid that we can use for a table or (better yet) prediction plotting. Here is an example from our earlier model that interacts SES and race.
predictions(ols2, newdata = datagrid(white = c(0,1),
ses = c(seq(-2,2,.5))))
## rowid type predicted std.error conf.low conf.high homework female
## 1 1 response 43.84540 1.7794447 40.34105 47.34974 2.023077 0.4923077
## 2 2 response 38.53992 1.6601588 35.27049 41.80935 2.023077 0.4923077
## 3 3 response 45.05075 1.3257450 42.43990 47.66160 2.023077 0.4923077
## 4 4 response 41.85382 1.3310918 39.23244 44.47521 2.023077 0.4923077
## 5 5 response 46.25610 1.0169114 44.25345 48.25875 2.023077 0.4923077
## 6 6 response 45.16773 1.0220658 43.15492 47.18053 2.023077 0.4923077
## 7 7 response 47.46145 0.9979733 45.49610 49.42681 2.023077 0.4923077
## 8 8 response 48.48163 0.7580017 46.98886 49.97440 2.023077 0.4923077
## 9 9 response 48.66681 1.2818444 46.14241 51.19120 2.023077 0.4923077
## 10 10 response 51.79554 0.6013732 50.61122 52.97985 2.023077 0.4923077
## 11 11 response 49.87216 1.7250021 46.47503 53.26929 2.023077 0.4923077
## 12 12 response 55.10944 0.6370776 53.85481 56.36407 2.023077 0.4923077
## 13 13 response 51.07751 2.2346078 46.67679 55.47823 2.023077 0.4923077
## 14 14 response 58.42334 0.8409644 56.76719 60.07949 2.023077 0.4923077
## 15 15 response 52.28286 2.7742827 46.81934 57.74639 2.023077 0.4923077
## 16 16 response 61.73725 1.1250374 59.52166 63.95284 2.023077 0.4923077
## 17 17 response 53.48822 3.3294368 46.93140 60.04503 2.023077 0.4923077
## 18 18 response 65.05115 1.4426824 62.21001 67.89229 2.023077 0.4923077
## white ses
## 1 0 -2.0
## 2 1 -2.0
## 3 0 -1.5
## 4 1 -1.5
## 5 0 -1.0
## 6 1 -1.0
## 7 0 -0.5
## 8 1 -0.5
## 9 0 0.0
## 10 1 0.0
## 11 0 0.5
## 12 1 0.5
## 13 0 1.0
## 14 1 1.0
## 15 0 1.5
## 16 1 1.5
## 17 0 2.0
## 18 1 2.0
Here I asked for predictions for white and non-white students. I also asked for specific values of SES – a sequence of values from -2 to 2 with an interval of 0.5. Now let’s plot these predictions using ggplot.
predictions(ols2, newdata = datagrid(white = c(0,1),
ses = c(seq(-2,2,.5)))) %>%
ggplot(aes(x = ses, y = predicted, ymin = conf.low, ymax = conf.high)) +
geom_ribbon(aes(fill = factor(white)), alpha = .2) +
geom_line(aes(color = factor(white)))
A few notes on the code. I started with the same prediction command from above, then used piping (%>%) to apply ggplot to the datagrid output. The aes option sets the y-axis to predicted, which provides the predicted values in the datagrid table. Low and high confidence interval values provide the error regions, plotted using the geom_ribbon command, with alpha = .2 setting the transparency for those error regions. The white variable must be treated as a factor variable to be used in the ribbon and line commands.
The results of the plot are very similar to what we achieved previously with the plot_cap command. But the prediction command provides more flexibility in setting the prediction conditions.
What if instead of error regions we wanted to estimate error bars? This could be useful when we have an x variable that is categorical instead of continuous. Here is how I did this for our SES*white interaction.
ols2_pred <- predictions(ols2, newdata = datagrid(white = c(0,1),
ses = c(seq(-2,2,.5))))
p <- ggplot(data = ols2_pred, aes(x = ses,
y = predicted,
color = factor(white),
ymin = conf.low,
ymax = conf.high))
p + geom_pointrange() +
labs(x = "SES", y = "Math Score") +
ggtitle("Predicted Math Scores by SES and Race") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_discrete(name = "Race",
labels = c("Non-white","White"))
First I saved the predictions as an object (ols2pred). Then I saved a base ggplot data statement (p). Finally, I generated a plot that builds on p, using geom_pointrange to plot the error bars, setting a title (ggtitle), centering the title (theme), and adjusting the legend (scale_color_discrete).
For more on marginaleffects: https://vincentarelbundock.github.io/marginaleffects/
Many other margins and prediction packages exist. For example: https://cran.r-project.org/web/packages/margins/vignettes/Introduction.html
I also recommend Kieran Healy’s chapter on modeling in his data visualization book: https://socviz.co/modeling.html