Community Detection

When examining a network graph, one of the first things people look for is elements of the graph that comprise subgroups. These subgroups are often comprised of nodes that are highly connected relative to the rest of the graph. By noting the different subgroups, we can learn something about how the people in the network see themselves and in interact with one another.

While the “eye-test” allows us to visually demarcate different subgroups or “communities” within a network, community detection can be better deduced through the use of a series of algorithms. We will explore some examples of this below. First it will be necessary to load a few familiar packages: igraph for conducting network analysis, ggraph for presenting the graphs, dplyr for manipulating the data, patchwork for presenting side-by-side graphs, and ggforce for drawing polygons.

library(igraph)
library(ggraph)
library(dplyr)
library(patchwork)
library(ggforce)

Load network data

When I was younger, I used to pay attention to who my classmates were close friends with. Knowing who was in which clique seemed to offer insights into how people interacted with one another.

Let’s use Facebook data to identify friendship communities among students in France. The code below loads the data as an edgelist, along with attributes associated with each student. We will limit our analysis only to the Facebook friendship relationships.

schoolfriends_edgelist <- read.csv(
  "https://ona-book.org/data/schoolfriends_edgelist.csv")

schoolfriends_vertices <- read.csv(
  "https://ona-book.org/data/schoolfriends_vertices.csv"
)

schoolfriends_fb <- schoolfriends_edgelist |> 
  filter(type == "facebook")

# create undirected graph
fb <- graph_from_data_frame(
  d = schoolfriends_fb,
  vertices = schoolfriends_vertices,
  directed = F)

Visualize the relationships

Now let’s display the Facebook friendship relationships.

ggraph(fb, layout = "fr") + 
  geom_edge_link(color = "darkgrey") +
  geom_node_point(size = 2, color = "blue") +
  theme_void()

This graph shows that only a portion of the student are on Facebook. The isolate nodes on the periphery are not members of this social media platform. Therefore, let’s simply focus on the largest connected group within the graph.

Largest connected component

This can be done in several ways. One way is to identify the largest connected component in the graph. To do this, we can use the components command to identify various disconnected subgroups. The next step is to identify the vertices which constitute the largest connected subgroup. Once we have this list, we can extract this subgroup and display the network.

comps <- components(fb)
largest <- which.max(comps$csize)
fb1 <- induced_subgraph(fb,which(comps$membership == largest))

ggraph(fb1, layout = "fr") + 
  geom_edge_link(color = "darkgrey") +
  geom_node_point(size = 2, color = "blue") +
  theme_void()

Cliques

Subgroups tend to be identified on the basis of cohesion, which is determined by dense connections among actors within a network. There are several ways to identify these communities. The most extreme way to define a subgroup is by identifying cliques – maximally connected subcomponents of a graph. In other words, a clique is a set of actors in which every subgroup member is directly connected to everyone else in the group. This subgroup therefore represents a complete subgraph. Cliques are comprised of at least three vertices.

R can provide a list of all cliques in the Facebook network. But because this is a large and highly connected network, let’s use the head command to only ask for the first six cliques.

head(cliques(fb1))
## [[1]]
## + 1/156 vertex, named, from 8cad18f:
## [1] 272
## 
## [[2]]
## + 1/156 vertex, named, from 8cad18f:
## [1] 1218
## 
## [[3]]
## + 1/156 vertex, named, from 8cad18f:
## [1] 151
## 
## [[4]]
## + 1/156 vertex, named, from 8cad18f:
## [1] 520
## 
## [[5]]
## + 1/156 vertex, named, from 8cad18f:
## [1] 132
## 
## [[6]]
## + 1/156 vertex, named, from 8cad18f:
## [1] 364

Each item in the list indicates that there are 156 cliques within this network. That is not surprising given how many nodes and how dense this network is. One can use the “min=” option to examine a subset of the largest cliques. In this case, the min value is set to 14 nodes in a clique.

cliques(fb1,min=14)
## [[1]]
## + 14/156 vertices, named, from 8cad18f:
##  [1] 466  376  638  841  1423 1218 769  797  624  440  125  325  694  245 
## 
## [[2]]
## + 14/156 vertices, named, from 8cad18f:
##  [1] 466  376  638  841  1237 1423 1218 769  624  440  125  325  694  245 
## 
## [[3]]
## + 14/156 vertices, named, from 8cad18f:
##  [1] 466  376  638  841  1423 1218 769  797  440  125  325  694  245  525 
## 
## [[4]]
## + 14/156 vertices, named, from 8cad18f:
##  [1] 466  376  638  841  1423 1218 1067 797  624  440  125  325  694  245 
## 
## [[5]]
## + 14/156 vertices, named, from 8cad18f:
##  [1] 466  376  638  841  1237 1423 1218 1067 624  440  125  325  694  245 
## 
## [[6]]
## + 14/156 vertices, named, from 8cad18f:
##  [1] 466  376  841  1237 1423 1218 769  624  440  125  325  564  694  245

Here we can see a set of 6 sets of cliques the size of 14 in the graph. Notably, there is much overlap in these cliques, with many of the nodes appearing across multiple cliques. This suggests that there is a tightly bound group of students in a single segment of the network.

Coreness

Coreness provides another, slightly more relaxed way, to identify cohesive subgroups in social networks. It is more relaxed in that group members do not need to all be directly connected to one another as with a clique. Instead, group membership is defined by their degree value. Specifically, a k-core is a maximally connected subgroup in which each node is adjacent to at least k other nodes. Nodes with lower k values tend to be on the periphery of a graph, whereas nodes with higher k values constitute a dense core.

Below we can use the coreness function to obtain k-core estimates for each node. Then we can examine the frequency of k values.

core <- coreness(fb1)
table(core)
## core
##  1  3  4  5  6  7  8  9 10 11 12 13 14 15 
##  1  4  8  4  9  2  7  2  8  7 13 15 39 37

The result from the table show that k values range from 1 to 15. Note that most of the students in this network are closer to the high end of the distribution, with a k value of 14 representing the modal (most numerous) category. If a student has a k of 14, it does not mean that they have a degree of 14, but instead that they belong to a group of students where every member has a degree of at least 14.

Let’s examine how coreness is distributed across the graph. We assign the core estimate as a vertex attribute. When graphing the network, we can apply that value as a label for each vertex and a shade of red to indicate increasing core membership in the network.

V(fb1)$coreness <- core

ggraph(fb1, layout = "kk") + 
  geom_edge_link(color = "grey",
                 arrow = arrow(length = unit(0.2, "cm"))) +
  geom_node_point(aes(color = as.numeric(coreness)), size = 4) +
  geom_node_text(aes(label = coreness), size = 3) +
  scale_color_gradient(low = "lightpink", high = "red", 
                       name = "Coreness Level") +
  theme_void()

What can we say about the coreness structure? We clearly have two different cores in this graph, indicated by the k = 15 nodes on the left and the k = 14 nodes on the right. These represent two relatively distinct cores of interaction. The lighter pink, low k-value nodes tend to be on the exterior of the graph, indicating that they are relatively peripheral actors.

Modularity: Group attributes

Communities are often defined based on shared attributes among group members. In a classroom setting, the teacher might ask all girls to stand along one wall and all boys to stand along the opposite wall. In network analysis, we might be interested in whether those shared attributes help to define the relationships within the classroom. For example, imagine a classroom in which boys are only friends with other boys and girls are only friends with other girls. As such, gender largely defines the relationships of the classroom friendship network.

The concept of modularity refers to the extent that relationships in a network are defined by group membership. The hypothetical example above represents a high level of gender modularity because ties exist only among same gender members, with no cross-gender friendships. To be more precise, modularity is an estimate of density among in-group members relative to out-group members. A value of 1 reflects perfect modularity, in that all individuals connect to all other individuals from the same group and not to any out-group members. A value of -1 indicates that people only connect to out-group members. A value of 0 indicates a random mix of ties that is completely unrelated to group membership.

How modular is the Facebook network? Let’s examine two attributes: 1) student membership based on being in the same class at school and 2) student membership based on gender.

(class_mod <- modularity(fb1, as.factor(V(fb1)$class)))
## [1] 0.317128
(gender_mod <- modularity(fb1, as.factor(V(fb1)$gender)))
## [1] 0.06504432

For both attributes, the modularity scores are positive, which suggests a general tendency for Facebook friendships to form between students who are members of the same class and the same gender. But the value for class modularity is substantially higher that the value for gender modularity, which shows that classroom membership is a better determinant of friendship in this network than is gender.

To illustrate this difference, let’s present two graphs side-by-side, with the first showing node colors based on class differences and the second showing nodes based on gender differences. We can print the modularity estimates as part of the title to illustrate the distinctions between class and gender connectivity.

fbc <- ggraph(fb1, layout = "kk") +  
  geom_edge_link(color = "grey") +
  geom_node_point(aes(color = as.factor(class)), 
                  size = 2, show.legend = FALSE) +
  ggtitle(paste("Class (mod =", round(class_mod, 3), ")")) +
  theme_void()
fbg <- ggraph(fb1, layout = "kk") +  
  geom_edge_link(color = "grey") +
  geom_node_point(aes(color = as.factor(gender)), 
                  size = 2, show.legend = FALSE) +
  ggtitle(paste("Gender (mod =", round(gender_mod, 3), ")")) +
  theme_void()

fbc + fbg

Can you see why we get higher modularity values for class than for gender?

The upper right cluster in the first graph is largely composed of three distinct colors (classrooms), compared to a different set of three colors that constitute the lower left cluster. By contrast, the green and red nodes in the second graph appear to be more randomly distributed across the network space, which helps to explain why the gender modularity value is so low.

Community Detection

So far we have addressed modularity in terms of vertex attributes that define either assigned categories (classrooms) or personal attributes (gender). But one of the most exciting developments within social network analysis is the ability to define group membership on the basis of ties rather than node-level attributes.

Several different community detection algorithms have been established in order to accomplish this task. While there are many different types of algorithms, each with their own approach to assigning nodes to groups, they all share a general goal of identifying subgroups that maintain relatively high levels of group modularity.

Let’s consider three different modularity algorithms. First, the Louvain algorithm emphasizes modularity optimization. It assigns and reassigns nodes to different groups in order to identify a maximal modularity score. Second, edge betweenness identifies distinct groups by iteratively removing from the network edges that constitute bridges between different communities. Third, random walk estimates multiple random pathways between nodes across the network, clustering nodes into communities based on the extent to which they frequently appear as part of the same paths.

Below we estimate these three algorithms, apply the membership vector to the igraph object, and compare the modularity scores to the class and gender attributes.

### Louvain
louvain <- cluster_louvain(fb1)
V(fb1)$louvain <- membership(louvain)
louvain_mod <- modularity(fb1, V(fb1)$louvain)

### Edge betweenness
edgeb <- cluster_edge_betweenness(fb1)
V(fb1)$edgeb <- membership(edgeb)
edgeb_mod <- modularity(fb1, V(fb1)$edgeb)

### walktrap
walktrap <- cluster_walktrap(fb1)
V(fb1)$walktrap <- membership(walktrap)
walktrap_mod <- modularity(fb1, V(fb1)$walktrap)

## tabulating modularity scores
rbind(louvain_mod,edgeb_mod,walktrap_mod,class_mod,gender_mod)
##                    [,1]
## louvain_mod  0.52900140
## edgeb_mod    0.48008469
## walktrap_mod 0.51405348
## class_mod    0.31712801
## gender_mod   0.06504432

What can we tell from this comparison of modularity scores? First, it seems as though each algorithm does a better job of capturing community clustering than either of the attributes. The values for Louvain, edge betweenness, and walktrap are all substantially higher than the modularity scores for the attributes. Among the algorithms, edge betweenness performs slightly worse, which suggests that bridging ties may not be the best way to define communities in this network.

A visual inspection can help us to further differentiate the approaches. Below we present the three community detection groups plus the class membership groups.

## visualize the differences
g1 <- ggraph(fb1, layout = "kk") +  
  geom_edge_link(color = "grey") +
  geom_node_point(aes(color = as.factor(louvain)), 
                  size = 2, show.legend = FALSE) +
  ggtitle(paste("Louvain (mod =", round(louvain_mod, 3), ")")) +
  theme_void()
g2 <- ggraph(fb1, layout = "kk") +  
  geom_edge_link(color = "grey") +
  geom_node_point(aes(color = as.factor(edgeb)), 
                  size = 2, show.legend = FALSE) +
  ggtitle(paste("Edge Betweenness (mod =", round(edgeb_mod, 3), ")")) +
  theme_void()
g3 <- ggraph(fb1, layout = "kk") +  
  geom_edge_link(color = "grey") +
  geom_node_point(aes(color = as.factor(walktrap)), 
                  size = 2, show.legend = FALSE) +
  ggtitle(paste("Walktrap (mod =", round(walktrap_mod, 3), ")")) +
  theme_void()
g4 <- ggraph(fb1, layout = "kk") +  
  geom_edge_link(color = "grey") +
  geom_node_point(aes(color = as.factor(class)), 
                  size = 2, show.legend = FALSE) +
  ggtitle(paste("Class (mod =", round(class_mod, 3), ")")) +
  theme_void()

g1 + g2 + g3 + g4 

What can we tell from these four different groupings? First, while the classroom membership groups looked to be superior to the gender membership groups, the community detection groups better distinguish the groups across the two large clusters. Edge betweenness emphasizes the distinctiveness of the two large clusters (red on the right and two shades of green on the left), which makes sense due to the emphasis on bridging. Alternatively, Louvain and walktrap tend to capture more subtleties among the two large clusters.

Community Polygons

Finally, it can sometimes be useful to distinguish between network communities by displaying a polygon that wraps around the groups. Below we apply the polygons to the walktrap results by adding an extra geom_mark_hull layer to the earlier walktrap plot.

g3 + 
  geom_mark_hull(aes(x = x, y = y, group = walktrap, fill = as.factor(walktrap)), 
                 concavity = 5, expand = unit(2, "mm"), alpha = 0.25,
                 show.legend = FALSE)

This plot shows both the distinctiveness (see the pink and blue regions) and the overlap (see the yellow, green, and purple regions) between different communities in the graph.

Community detection algorithms are very useful for identifying subgroups within social networks. That said, there is not one best community detection algorithm to use for all analyses. Your decision depends on the type of network you are examining, the questions you are proposing, and the micro-interactional processes that are central to the formation and maintenance of network ties.