Generating predictions from OLS models

While coefficients in OLS are fairly intuitive, it is often helpful to estimate average marginal effects, conditional marginal effects, and predicted values from regression models. This tutorial provides an overview of how to obtain and present these estimates.

The first thing to do is to load the necessary packages.

library(tidyverse) # always load the tidy tools
library(haven) # to read in Stata .dta files
library(marginaleffects) # package for estimating margins

Let’s read in some educational data for our example.

mlmdata <- read_dta("https://stats.idre.ucla.edu/stat/examples/imm/imm10.dta")
glimpse(mlmdata)
## Rows: 260
## Columns: 19
## $ schid    <dbl> 7472, 7472, 7472, 7472, 7472, 7472, 7472, 7472, 7472, 7472, 7~
## $ stuid    <dbl> 3, 8, 13, 17, 27, 28, 30, 36, 37, 42, 52, 53, 61, 64, 72, 83,~
## $ ses      <dbl> -0.13, -0.39, -0.80, -0.72, -0.74, -0.58, -0.83, -0.51, -0.56~
## $ meanses  <dbl> -0.4826087, -0.4826087, -0.4826087, -0.4826087, -0.4826087, -~
## $ homework <dbl> 1, 0, 0, 1, 2, 1, 5, 1, 1, 2, 1, 1, 1, 2, 1, 4, 1, 2, 1, 1, 1~
## $ white    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ parented <dbl> 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, 2, 3, 2, 1, 2, 3, 3, 1, 3, 3, 3~
## $ public   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ ratio    <dbl> 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 1~
## $ percmin  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ math     <dbl> 48, 48, 53, 42, 43, 57, 33, 64, 36, 56, 48, 48, 44, 35, 50, 3~
## $ sex      <dbl> 2, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2~
## $ race     <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4~
## $ sctype   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ cstr     <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
## $ scsize   <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3~
## $ urban    <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
## $ region   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
## $ schnum   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~

Create a dummy variable for female. Estimate a model predicting mathematics scores.

mlmdata$female <- ifelse(mlmdata$sex == 2, 1, 0)
ols <- lm(math ~ homework + ses + female + white, mlmdata)
summary(ols)
## 
## Call:
## lm(formula = math ~ homework + ses + female + white, data = mlmdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.9356  -5.8326   0.5589   5.2318  23.8706 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.7198     1.4598  32.005  < 2e-16 ***
## homework      2.1654     0.3648   5.936 9.51e-09 ***
## ses           5.5351     0.6394   8.656 5.61e-16 ***
## female       -0.2930     1.0254  -0.286    0.775    
## white         1.0309     1.2741   0.809    0.419    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.244 on 255 degrees of freedom
## Multiple R-squared:  0.4604, Adjusted R-squared:  0.4519 
## F-statistic: 54.39 on 4 and 255 DF,  p-value: < 2.2e-16

Homework and socio-economic status are positively and significantly associated with math scores. Gender and race differences in math scores are not statistically significant.

Average marginal effects (AME) refer to the mean differences in the outcome for each unit of the independent variables. In a basic OLS regression with no interactions and no curvilinear terms, AME is analogous to the B coefficients.

head(marginaleffects(ols))
##   rowid     type     term    dydx std.error math homework   ses female white
## 1     1 response homework 2.16542 0.3648097   48        1 -0.13      1     1
## 2     2 response homework 2.16542 0.3648099   48        0 -0.39      0     1
## 3     3 response homework 2.16542 0.3648101   53        0 -0.80      0     1
## 4     4 response homework 2.16542 0.3648097   42        1 -0.72      0     1
## 5     5 response homework 2.16542 0.3648099   43        2 -0.74      1     1
## 6     6 response homework 2.16542 0.3648097   57        1 -0.58      1     1

The marginaleffects command generates a series of AME estimates conditional on the other variables. Just the first six are presented here. Notice that the effect values for homework (dydx) are identical to its coefficient in the model, regardless of the conditions (see the variables to the right).

So why are AMEs useful? Why not just rely on the coefficient estimates? Well, AMEs can be helpful in interpreting more complex models (like generalized linear models, logit models, count models, etc.). I will address this issue in another tutorial. But a second reason, and one that applies to this tutorial, is that AMEs can be useful for evaluation of conditional effects and interaction effects.

Conditional Marginal Effects

Let’s review conditional effects. First, I will use the plot_cme command to display the AME for a single variable (SES), conditional on another variable (white).

plot_cme(ols, effect = "ses", condition = c("white"))

The above script produces a straight horizontal line, suggesting that the marginal effects of SES do not vary across conditions of race. This is because this model estimates independent effects of SES and race, which is to say, individual effects are not contingent on the effects of the other variables.

Now let’s see what happens to the margins if we add an interaction effect between SES and race to the original.

ols2 <- lm(math ~ homework + ses + female + white + ses:white, mlmdata)
plot_cme(ols2, effect = "ses", condition = c("white"))

The new plot shows that the effects of SES are contingent on race. Those effects are stronger for white students (white = 1) than for non-white students (white = 0). An examination of the model estimates shows that the interaction effect (SES*white) is statistically significant.

summary(ols2)
## 
## Call:
## lm(formula = math ~ homework + ses + female + white + ses:white, 
##     data = mlmdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.378  -5.556   0.102   5.108  22.677 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44.7582     1.5630  28.637  < 2e-16 ***
## homework      2.1021     0.3591   5.854 1.48e-08 ***
## ses           2.4107     1.1708   2.059  0.04050 *  
## female       -0.6990     1.0159  -0.688  0.49201    
## white         3.1287     1.4170   2.208  0.02813 *  
## ses:white     4.2171     1.3333   3.163  0.00175 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.102 on 254 degrees of freedom
## Multiple R-squared:  0.4808, Adjusted R-squared:  0.4706 
## F-statistic: 47.05 on 5 and 254 DF,  p-value: < 2.2e-16

Predicted Values

In addition to estimating average and conditional marginal effects, we might also want to generate predicted values for various groups in the models. For this, I’ll start with the orginal model without the interaction term.

plot_cap(ols, condition = c("ses"))

Note that the scale for the y-axis is different. Instead of estimating marginal effects (i.e., regression coefficient estimates or B values), the plot_cap command estimates predicted values on the dependent variable. The range of predicted values is from math scores of about 35 to 65. Now let’s add race as an additional condition.

plot_cap(ols, condition = c("ses", "white"))

Instead of a single line, we now get two lines or separate plots for white and non-white students. Those two lines are parallel, with white students having a constant advantage in math scores at each level of SES.

Remember that this is the non-interactive model, which means that we have modeled the effects of SES and race to be independent of one another. Thus, the effect of SES is unaffected by race. The constant advantage of whites over non-whites is equal to the white coefficient from that model (1.0309).

Now let’s run plot_cap again, this time to examine the predictions from the interactive model.

plot_cap(ols2, condition = c("ses", "white"))

The inclusion of the statistical interaction allows the predicted values for SES to vary across race categories. The plot suggests that SES has a greater positive impact on math scores for white students than non-white students.

The error regions of the graph also show us where we can be especially confident of the differences in those effects. In this case, the predictions significantly diverge at around SES > .25, where the error regions separate.

This shows an interaction between a categorical and a continuous variable. How about plotting predictions from an interaction between two continuous variables?

ols3 <- lm(math ~ homework + ses + female + white + homework:ses, mlmdata)
plot_cap(ols3, condition = c("homework", "ses"))

The ols3 model estimates an interaction between SES and homework. The plot_cap command shows how the homework predictions vary across different levels of SES. The SES levels are transformed into quintiles and plotted as 5 separate lines.

The plot shows vertical distances between the lines (reflecting SES differences), but the slopes are very similar. This shows that there is no statistical interaction between these two factors (which can be confirmed by reviewing the model statistics).

How about non-linear estimates from models? Below I add a quadratic term to the model to estimate whether there are any threshold effects on homework. Then we can examine differences in the marginal effects and predictions.

ols4 <- lm(math ~ homework + I(homework^2) + ses + female + white , mlmdata)
plot_cme(ols4, effect = "homework", condition = c("homework"))

plot_cap(ols4, condition = c("homework"))

The addition of the squared term for homework allows the effect of homework on math scores to vary across different values. And the cme plot does show a modest increase in the homework effect on math scores as people devote more hours to homework.

The predictions reflect the increasing homework returns to math scores. But the upward curve is subtle and inspection of the model estimates show that…

summary(ols4)
## 
## Call:
## lm(formula = math ~ homework + I(homework^2) + ses + female + 
##     white, data = mlmdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.3299  -5.8265   0.5769   5.2049  22.5578 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    48.0629     1.7750  27.078  < 2e-16 ***
## homework        0.6778     1.1796   0.575    0.566    
## I(homework^2)   0.2789     0.2103   1.326    0.186    
## ses             5.5651     0.6389   8.711 3.94e-16 ***
## female         -0.2622     1.0241  -0.256    0.798    
## white           0.8175     1.2823   0.638    0.524    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.232 on 254 degrees of freedom
## Multiple R-squared:  0.4641, Adjusted R-squared:  0.4536 
## F-statistic:    44 on 5 and 254 DF,  p-value: < 2.2e-16

…neither the base term nor the quadratic term for homework are statistically significant. A linear term alone offers a much better fit to the data.

Prediction datagrids

We might want to generate predictions for very specific conditions and contrasts. For this we can turn to the predictions command, which creates a datagrid that we can use for a table or (better yet) prediction plotting. Here is an example from our earlier model that interacts SES and race.

predictions(ols2, newdata = datagrid(white = c(0,1), 
                                    ses = c(seq(-2,2,.5))))
##    rowid     type predicted std.error conf.low conf.high homework    female
## 1      1 response  43.84540 1.7794447 40.34105  47.34974 2.023077 0.4923077
## 2      2 response  38.53992 1.6601588 35.27049  41.80935 2.023077 0.4923077
## 3      3 response  45.05075 1.3257450 42.43990  47.66160 2.023077 0.4923077
## 4      4 response  41.85382 1.3310918 39.23244  44.47521 2.023077 0.4923077
## 5      5 response  46.25610 1.0169114 44.25345  48.25875 2.023077 0.4923077
## 6      6 response  45.16773 1.0220658 43.15492  47.18053 2.023077 0.4923077
## 7      7 response  47.46145 0.9979733 45.49610  49.42681 2.023077 0.4923077
## 8      8 response  48.48163 0.7580017 46.98886  49.97440 2.023077 0.4923077
## 9      9 response  48.66681 1.2818444 46.14241  51.19120 2.023077 0.4923077
## 10    10 response  51.79554 0.6013732 50.61122  52.97985 2.023077 0.4923077
## 11    11 response  49.87216 1.7250021 46.47503  53.26929 2.023077 0.4923077
## 12    12 response  55.10944 0.6370776 53.85481  56.36407 2.023077 0.4923077
## 13    13 response  51.07751 2.2346078 46.67679  55.47823 2.023077 0.4923077
## 14    14 response  58.42334 0.8409644 56.76719  60.07949 2.023077 0.4923077
## 15    15 response  52.28286 2.7742827 46.81934  57.74639 2.023077 0.4923077
## 16    16 response  61.73725 1.1250374 59.52166  63.95284 2.023077 0.4923077
## 17    17 response  53.48822 3.3294368 46.93140  60.04503 2.023077 0.4923077
## 18    18 response  65.05115 1.4426824 62.21001  67.89229 2.023077 0.4923077
##    white  ses
## 1      0 -2.0
## 2      1 -2.0
## 3      0 -1.5
## 4      1 -1.5
## 5      0 -1.0
## 6      1 -1.0
## 7      0 -0.5
## 8      1 -0.5
## 9      0  0.0
## 10     1  0.0
## 11     0  0.5
## 12     1  0.5
## 13     0  1.0
## 14     1  1.0
## 15     0  1.5
## 16     1  1.5
## 17     0  2.0
## 18     1  2.0

Here I asked for predictions for white and non-white students. I also asked for specific values of SES – a sequence of values from -2 to 2 with an interval of 0.5. Now let’s plot these predictions using ggplot.

predictions(ols2, newdata = datagrid(white = c(0,1), 
                                    ses = c(seq(-2,2,.5)))) %>%
  ggplot(aes(x = ses, y = predicted, ymin = conf.low, ymax = conf.high)) +
  geom_ribbon(aes(fill = factor(white)), alpha = .2) +
  geom_line(aes(color = factor(white)))

A few notes on the code. I started with the same prediction command from above, then used piping (%>%) to apply ggplot to the datagrid output. The aes option sets the y-axis to predicted, which provides the predicted values in the datagrid table. Low and high confidence interval values provide the error regions, plotted using the geom_ribbon command, with alpha = .2 setting the transparency for those error regions. The white variable must be treated as a factor variable to be used in the ribbon and line commands.

The results of the plot are very similar to what we achieved previously with the plot_cap command. But the prediction command provides more flexibility in setting the prediction conditions.

What if instead of error regions we wanted to estimate error bars? This could be useful when we have an x variable that is categorical instead of continuous. Here is how I did this for our SES*white interaction.

ols2_pred <- predictions(ols2, newdata = datagrid(white = c(0,1),
                                                ses = c(seq(-2,2,.5))))

p <- ggplot(data = ols2_pred, aes(x = ses,
                                  y = predicted, 
                                  color = factor(white),
                                  ymin = conf.low, 
                                  ymax = conf.high))

p + geom_pointrange() + 
  labs(x = "SES", y = "Math Score") +
  ggtitle("Predicted Math Scores by SES and Race") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_discrete(name = "Race", 
                       labels = c("Non-white","White"))

First I saved the predictions as an object (ols2pred). Then I saved a base ggplot data statement (p). Finally, I generated a plot that builds on p, using geom_pointrange to plot the error bars, setting a title (ggtitle), centering the title (theme), and adjusting the legend (scale_color_discrete).

For more on marginaleffects: https://vincentarelbundock.github.io/marginaleffects/

Many other margins and prediction packages exist. For example: https://cran.r-project.org/web/packages/margins/vignettes/Introduction.html

I also recommend Kieran Healy’s chapter on modeling in his data visualization book: https://socviz.co/modeling.html