Exponential Random Graph Models

The following provides some guidance for running and interpreting exponential random graph models (or ERGMs). ERGMs are network models that estimate the probability of tie formation. These models are useful because they allow researchers to account for dependencies in network data (rather than assuming independence of units). They use simulation techniques to develop inferential statistics, showing how an observed network is unique from what might be found by chance.

The first thing to do is to load the statnet package.

library(statnet)

Florentine marriage ties

Let’s use a network that is available as part of the statnet package. Florentine refers to network data constructed from historical documents on the relationships between powerful families in Renaissance Florence. We will focus on the marriage ties between families.

data(florentine)
flomarriage
##  Network attributes:
##   vertices = 16 
##   directed = FALSE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   bipartite = FALSE 
##   total edges= 20 
##     missing edges= 0 
##     non-missing edges= 20 
## 
##  Vertex attribute names: 
##     priorates totalties vertex.names wealth 
## 
## No edge attributes

The output shows that we have 16 vertices or nodes in our network, which represent 16 families. The total edges is 20, which means that we have 20 ties or lines between families based on marriage. This is a non-directed network (“directed = FALSE”) because ties are shared between vertices, rather than one node sending a tie to a receiver.

This network contains a set of vertex attributes. We’ll discuss those later. The edges have no attributes. Values for edges are either 1 (a marriage tie exists between families) or 0 (no marriage tie exists between families). These edge values serve with serve as the dependent variable for the ERGM analysis.

Let’s plot out the network to see what it looks like.

par(mar=c(0,0,0,0))
plot(flomarriage)

Modeling endogenous features

ERGMs allow researchers to ask two different types of questions: 1) how is the structure of ties in the observed network unique from other similar networks, and 2) how are the features of nodes associated with tie formation?

The first question is about “endogenous” features. These are the aspects of networks that can be used to define a network. A good example is density, which refers to the ratio of ties between nodes to all possible ties between nodes. Dense networks have lots of ties connecting the different actors, whereas ties are rare in less dense (or sparse) networks. An ERGM can tell us whether our network is more or less dense than other networks.

Let’s run a basic ERGM on the flomarriage network that models density, then we can talk through what the model tells us.

set.seed(162809)
flo1 <- ergm(flomarriage ~ edges) 
## Starting maximum pseudolikelihood estimation (MPLE):
## Evaluating the predictor and response matrix.
## Maximizing the pseudolikelihood.
## Finished MPLE.
## Stopping at the initial estimate.
## Evaluating log-likelihood at the estimate.
summary(flo1)
## Call:
## ergm(formula = flomarriage ~ edges)
## 
## Maximum Likelihood Results:
## 
##       Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges  -1.6094     0.2449      0  -6.571   <1e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 166.4  on 120  degrees of freedom
##  Residual Deviance: 108.1  on 119  degrees of freedom
##  
## AIC: 110.1  BIC: 112.9  (Smaller is better. MC Std. Err. = 0)

The first thing I did was to set a random seed. This sets a fixed starting point for the algorithm, ensuring that the results are the same each time we run this model.

The model sets the flomarriage network as the dependent variable. It’s technically estimating the logged of tie formation. The only parameter included in this model is “edges.” This allows us to estimate how dense the network is. The summary output reveals a negative and statistically significant coefficient for edges. This means that density in the flomarriage network is significantly lower than what one would expect to see at random.

How does the ERG procedure determine this estimate and significance? It simulates a large number of networks of the same size with random numbers of edges. These simulated networks generate a random distribution of density values. Then the model compares where the density of the flomarriage network fits on that distribution. In this case, the flomarriage density is quite a bit lower than the average density of the simulated networks.

Let’s add another endogenous feature.

flo2 <- ergm(flomarriage ~ edges +
               triangle) 
## Starting maximum pseudolikelihood estimation (MPLE):
## Evaluating the predictor and response matrix.
## Maximizing the pseudolikelihood.
## Finished MPLE.
## Starting Monte Carlo maximum likelihood estimation (MCMLE):
## Iteration 1 of at most 60:
## Optimizing with step length 1.0000.
## The log-likelihood improved by 0.0052.
## Convergence test p-value: < 0.0001. Converged with 99% confidence.
## Finished MCMLE.
## Evaluating log-likelihood at the estimate. Fitting the dyad-independent submodel...
## Bridging between the dyad-independent submodel and the full model...
## Setting up bridge sampling...
## Using 16 bridges: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 .
## Bridging finished.
## This model was fit using MCMC.  To examine model diagnostics and check
## for degeneracy, use the mcmc.diagnostics() function.
summary(flo2)
## Call:
## ergm(formula = flomarriage ~ edges + triangle)
## 
## Monte Carlo Maximum Likelihood Results:
## 
##          Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges     -1.6767     0.3628      0  -4.622   <1e-04 ***
## triangle   0.1583     0.6247      0   0.253      0.8    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 166.4  on 120  degrees of freedom
##  Residual Deviance: 108.1  on 118  degrees of freedom
##  
## AIC: 112.1  BIC: 117.6  (Smaller is better. MC Std. Err. = 0.00886)

Networks may also be defined by the extent to which they form triangles. This feature shows up a lot in naturally occurring networks due to a process called transitivity. If I am friends with Monique and I am also friends with Hector, the chances are that Monique and Hector will also become friends with one another. But marriage ties are different from friendship ties and here we can see that the tendency to form marriage triangles is not statistically significant.

For ERGMs to develop good fitting models, one should always include a density parameter. A variety of other endogenous features may be added as well. But note that each additional parameter adds further complexity to the model and can make hinder model convergence.

Exogenous features

Exogenous features refer to node level characteristics. Instead of telling us how common these features are in the network (as we saw with endogenous features), parameter estimates for exogenous features tell us about the extent to which node level characteristics are associated with tie formation.

When looking at the flomarriage network object, we saw several node level characteristics that might be associated with the formation of marriage ties. For example, we have a variable for the wealth of each family. We might assume that wealthy families have a tendency to form strategic marriage ties with other families. Let’s run an ERGM that tests this.

flo3 <- ergm(flomarriage ~ edges + 
               nodecov("wealth"))
## Starting maximum pseudolikelihood estimation (MPLE):
## Evaluating the predictor and response matrix.
## Maximizing the pseudolikelihood.
## Finished MPLE.
## Stopping at the initial estimate.
## Evaluating log-likelihood at the estimate.
summary(flo3)
## Call:
## ergm(formula = flomarriage ~ edges + nodecov("wealth"))
## 
## Maximum Likelihood Results:
## 
##                 Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges          -2.594929   0.536056      0  -4.841   <1e-04 ***
## nodecov.wealth  0.010546   0.004674      0   2.256   0.0241 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 166.4  on 120  degrees of freedom
##  Residual Deviance: 103.1  on 118  degrees of freedom
##  
## AIC: 107.1  BIC: 112.7  (Smaller is better. MC Std. Err. = 0)

The nodecov command allows one to model node covariates. The results show us that wealth is positively and significantly associated with tie formation. Because ERGMs approximate logistic regression models, the coefficients indicate the change in the logged odds of tie formation due to a one unit increase in the independent variable.

Another exogenous feature that can be examined is the (dis)similarity in the node level characteristics of two different actors. Similarity effects are essentially modeling the process of homophily or in other words how people who are similar to one another tend to form ties. As an example, we can examine the priorates variable, which measures the number of legislative seats controlled by each of the Florentine families.

flo4 <- ergm(flomarriage ~ edges + 
               nodecov("wealth") +
               absdiff("priorates"))
## Starting maximum pseudolikelihood estimation (MPLE):
## Evaluating the predictor and response matrix.
## Maximizing the pseudolikelihood.
## Finished MPLE.
## Stopping at the initial estimate.
## Evaluating log-likelihood at the estimate.
summary(flo4)
## Call:
## ergm(formula = flomarriage ~ edges + nodecov("wealth") + absdiff("priorates"))
## 
## Maximum Likelihood Results:
## 
##                    Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges             -2.529524   0.563818      0  -4.486   <1e-04 ***
## nodecov.wealth     0.011314   0.005145      0   2.199   0.0279 *  
## absdiff.priorates -0.004515   0.012130      0  -0.372   0.7097    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 166.4  on 120  degrees of freedom
##  Residual Deviance: 103.0  on 117  degrees of freedom
##  
## AIC: 109  BIC: 117.3  (Smaller is better. MC Std. Err. = 0)

The absdiff command estimates the absolute value of the difference between the number of priorates for each combination of actors in the network. Consequently, this parameter operates as an inverse estimate of homophily. The negative coefficient is in the general direction of homophily, implying that ties tend to form more often among families with similar numbers of priorates (or smaller gaps in priorate differences). However, this is not a statistically significant effect, so it fails to confirm our hypothesis about homophily.

Goodness of fit

Remember that ERG models are designed to simulate a large number of networks. Those networks should generate a stable distribution of parameter estimates for each of the covariates in the model. Those distributions provide the basis for making statistical inferences about the extent to which the observed estimates deviate from the norm. Model fit depends mainly on the extent to which the ERG procedure has effectively produced stable parameter distributions.

To assess goodness of fit for the previous model, let’s run the model again, but this time I’ll add the control option to force the running of Monte Carlo simulation chains. Then we can assess how well those simulations worked.

flo4 <- ergm(flomarriage ~ edges + 
               nodecov("wealth") +
               absdiff("priorates"),
               control = control.ergm(force.main = T))
## Starting maximum pseudolikelihood estimation (MPLE):
## Evaluating the predictor and response matrix.
## Maximizing the pseudolikelihood.
## Finished MPLE.
## Starting Monte Carlo maximum likelihood estimation (MCMLE):
## Iteration 1 of at most 60:
## Optimizing with step length 1.0000.
## The log-likelihood improved by 0.0043.
## Convergence test p-value: 0.0003. Converged with 99% confidence.
## Finished MCMLE.
## Evaluating log-likelihood at the estimate. Fitting the dyad-independent submodel...
## Bridging between the dyad-independent submodel and the full model...
## Setting up bridge sampling...
## Using 16 bridges: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 .
## Bridging finished.
## This model was fit using MCMC.  To examine model diagnostics and check
## for degeneracy, use the mcmc.diagnostics() function.
mcmc.diagnostics(flo4)
## Sample statistics summary:
## 
## Iterations = 49152:262144
## Thinning interval = 1024 
## Number of chains = 1 
## Sample size per chain = 209 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                       Mean      SD Naive SE Time-series SE
## edges              -0.1483   4.177   0.2889         0.3133
## nodecov.wealth    -31.0766 470.669  32.5568        32.4889
## absdiff.priorates -10.9952 166.253  11.5000        11.5000
## 
## 2. Quantiles for each variable:
## 
##                     2.5%  25% 50% 75% 97.5%
## edges               -8.0   -3   0   3   8.0
## nodecov.wealth    -879.0 -372 -58 257 890.0
## absdiff.priorates -334.8 -123 -37  89 321.2
## 
## 
## Are sample statistics significantly different from observed?
##                 edges nodecov.wealth absdiff.priorates Overall (Chi^2)
## diff.      -0.1483254    -31.0765550       -10.9952153              NA
## test stat. -0.4734816     -0.9565293        -0.9561078       1.7735557
## P-val.      0.6358696      0.3388049         0.3390178       0.6251953
## 
## Sample statistics cross-correlations:
##                       edges nodecov.wealth absdiff.priorates
## edges             1.0000000      0.8992568         0.8458602
## nodecov.wealth    0.8992568      1.0000000         0.8704899
## absdiff.priorates 0.8458602      0.8704899         1.0000000
## 
## Sample statistics auto-correlation:
## Chain 1 
##                edges nodecov.wealth absdiff.priorates
## Lag 0     1.00000000     1.00000000        1.00000000
## Lag 1024 -0.03761208    -0.03361205       -0.07239801
## Lag 2048  0.06592698     0.07485264        0.06751472
## Lag 3072  0.01957866     0.07935509        0.07215551
## Lag 4096  0.23107252     0.21017025        0.08353703
## Lag 5120 -0.10791521    -0.15346960       -0.02368803
## 
## Sample statistics burn-in diagnostic (Geweke):
## Chain 1 
## 
## Fraction in 1st window = 0.1
## Fraction in 2nd window = 0.5 
## 
##             edges    nodecov.wealth absdiff.priorates 
##           -0.4555           -0.9316           -1.4271 
## 
## Individual P-values (lower = worse):
##             edges    nodecov.wealth absdiff.priorates 
##         0.6487176         0.3515514         0.1535618 
## Joint P-value (lower = worse):  0.3231549 .

## 
## MCMC diagnostics shown here are from the last round of simulation, prior to computation of final parameter estimates. Because the final estimates are refinements of those used for this simulation run, these diagnostics may understate model performance. To directly assess the performance of the final model on in-model statistics, please use the GOF command: gof(ergmFitObject, GOF=~model).

We can observe numerous diagnostics, but the key summary statistics are at the end, just before the “Sample Statistics” plots. The “Joint P-value” allows us to assess overall convergence and the “Individual P-values” provide information about convergence for specific parameter estimates. P-values range from 0-1, and low values (P < 0.1) are considered to be problematic because they imply that the simulations failed to produce a normal distribution of estimates.

In this case, we can see that the p-values all exceed 0.1, which is good, though the edges and wealth parameters are only slightly above that threshold. Therefore, it is a good idea to visually inspect the simulated values for those covariates.

Take a look at the “Sample statistics” plots. The left column shows the sequencing of estimates generated from the simulations. The horizontal line represents the mean value. What we like to see is a mix of estimates that hover around the mean, a little above and a little below the line. Poor fit is indicated by major deviations from the line on either side. All three estimates appear to match what we are looking for; some minor deviations, but nothing more.

Now look at the right column, which plots the distribution of estimates. In this case, we are looking for normal curves that center on a mean of zero. While these are not perfectly normal, they do provide decent approximations of normal distributions. Consequently, I think we can consider this model to provide good, if not great, fit to the observed data.

Improving fit

Are there ways to improve fit? Of course. The most obvious way is to tweak the models by adding, subtracting, or modifying the parameters. Another option is to provide a better starting point for the simulations – one that better facilitates model convergence. Remember that the simulations begin from a random starting point. If we have difficulty generating models that converge, we can save the final estimates from an initial model and then use those estimates as a starting point for a new ERG procedure. Here’s an example of how to do that.

coefs <- as.numeric(coef(flo4)) 

flo5 <- ergm(flomarriage ~ edges + 
               nodecov("wealth") +
               absdiff("priorates"),
             control = control.ergm(init = coefs,
                                    force.main = T))
## Starting Monte Carlo maximum likelihood estimation (MCMLE):
## Iteration 1 of at most 60:
## Optimizing with step length 1.0000.
## The log-likelihood improved by 0.0025.
## Convergence test p-value: < 0.0001. Converged with 99% confidence.
## Finished MCMLE.
## Evaluating log-likelihood at the estimate. Fitting the dyad-independent submodel...
## Bridging between the dyad-independent submodel and the full model...
## Setting up bridge sampling...
## Using 16 bridges: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 .
## Bridging finished.
## This model was fit using MCMC.  To examine model diagnostics and check
## for degeneracy, use the mcmc.diagnostics() function.
mcmc.diagnostics(flo5)
## Sample statistics summary:
## 
## Iterations = 8192:524288
## Thinning interval = 1024 
## Number of chains = 1 
## Sample size per chain = 505 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                      Mean      SD Naive SE Time-series SE
## edges             -0.1465   3.905   0.1738         0.1738
## nodecov.wealth    -5.3406 468.578  20.8515        20.8515
## absdiff.priorates -0.1901 153.068   6.8114         6.8114
## 
## 2. Quantiles for each variable:
## 
##                     2.5%  25% 50% 75% 97.5%
## edges               -8.0   -3   0   2   8.0
## nodecov.wealth    -874.8 -331 -18 311 909.0
## absdiff.priorates -268.4 -110  -2 101 311.8
## 
## 
## Are sample statistics significantly different from observed?
##                 edges nodecov.wealth absdiff.priorates Overall (Chi^2)
## diff.      -0.1465347     -5.3405941       -0.19009901              NA
## test stat. -0.8432861     -0.2561256       -0.02790885       2.3525981
## P-val.      0.3990685      0.7978539        0.97773485       0.5048452
## 
## Sample statistics cross-correlations:
##                       edges nodecov.wealth absdiff.priorates
## edges             1.0000000      0.8801681         0.8219418
## nodecov.wealth    0.8801681      1.0000000         0.8254837
## absdiff.priorates 0.8219418      0.8254837         1.0000000
## 
## Sample statistics auto-correlation:
## Chain 1 
##                 edges nodecov.wealth absdiff.priorates
## Lag 0     1.000000000   1.0000000000       1.000000000
## Lag 1024 -0.012432567   0.0528035445      -0.041965622
## Lag 2048  0.049464085   0.0084810316      -0.004959019
## Lag 3072  0.009831598   0.0003787971      -0.023702670
## Lag 4096 -0.033688802  -0.0068471546       0.002662492
## Lag 5120  0.006331811   0.0171607092      -0.032123390
## 
## Sample statistics burn-in diagnostic (Geweke):
## Chain 1 
## 
## Fraction in 1st window = 0.1
## Fraction in 2nd window = 0.5 
## 
##             edges    nodecov.wealth absdiff.priorates 
##            1.0310            0.9247            0.1978 
## 
## Individual P-values (lower = worse):
##             edges    nodecov.wealth absdiff.priorates 
##         0.3025365         0.3551074         0.8431666 
## Joint P-value (lower = worse):  0.5740615 .

## 
## MCMC diagnostics shown here are from the last round of simulation, prior to computation of final parameter estimates. Because the final estimates are refinements of those used for this simulation run, these diagnostics may understate model performance. To directly assess the performance of the final model on in-model statistics, please use the GOF command: gof(ergmFitObject, GOF=~model).

This chunk of code saves the estimates from the previous model. I have added a new option to the control function (“init = coefs”) which tells R to use those previous estimates as the starting point for this model. When examining the diagnostics, notice that the p-values have increased substantially to the point where they are all well above the 0.1 threshold. This new model therefore appears to fit the data better.

There are other ways to improve model fit, such as adjusting the MCMC interval, burnin, and sample size. You should consult the ergm documentation for a detailed explanation of those options.

For a deeper dive, I can also recommend the statnet ergm tutorial.