This tutorial provides an example of how to scrape data from GitHub profiles, specifically the “overview” tab or main GitHub page. This script can be adopted to do more extensive scraping of GitHub websites.
Let’s begin by scraping the requisite packages
library(rvest)
library(tidyverse)
library(xml2)
library(readr)
Let’s begin by scraping a single GitHub webpage. We can use Kieran Healy’s page as an example. He is a computational sociologist who maintains an active GitHub page.
The first thing to do is to generate an object that contains the url for his website. Then we can pass that information to the read_html command which gathers the html code and saves it as an object named “page”.
url <- print(paste("https://github.com/kjhealy"))
## [1] "https://github.com/kjhealy"
page <- read_html(url)
Now that we have saved the page, we can select pieces of data that are embedded within the html code. This requires either a) detailed knowledge of html coding (which I do not have) or b) a tool that can be used to identify the section of the html code that contains the desired pieces of data. Let’s go with option b.
Selector Gadget is a plug-in application for Google Chrome and other web browsers. The faintly highlighted circle at top right in the image below shows the plug-in. The bar at bottom right (“No valid path found.”) shows the place where we will gather our html information from.
Once Selector Gadget is…um, selected…we can click on the information that we want to gather. I will start by selecting Kieran’s name. In the image below, note that Kieran’s name is highlighted in green (that’s where I clicked). Also, we now have a keyword in the text box (“.overflow-hidden”) which tells us where the name information is located. Next to that text box is a field that says “Clear (3)”. That means that the keyword points to three different pieces of information, not just to the name. The highlighted yellow portion of the webpage shows where the additional information is coming from.
We can further restrict the portion of information to select by clicking on the yellow section. In this case, the calendar now turns red (indicating that this segment is not part of the selected information), but several other (50+!) segments are now targeted.
Clicking on another yellow section (see below) restricts our focus to the single piece of information that we are interested in: the GitHub user’s name. With this selection, we can now copy the key piece of information we are interested in: “.d-block.overflow-hidden”.
Now that we have obtained that code, we can pass it through the html_nodes command.
name <- html_nodes(page,".d-block.overflow-hidden") |>
html_text()
name
## [1] "\n Kieran Healy\n "
Note that the title contains unwanted information. Let’s remove the character returns and trim the extra spaces. Much better!
name <- str_replace_all(name, "([\n])", "")
name
## [1] " Kieran Healy "
name <- str_trim(name, side = c("both"))
name
## [1] "Kieran Healy"
Now let’s gather additional information. Below I’ve identified the paths to information about the number of repositories as well as the number of stars that the user has achieved. Note how I use the parse_number command to select only the numerical elements of the selected information.
repos <- html_nodes(page,".selected+ .js-selected-navigation-item") |>
html_text()
repos
## [1] "\n \n Repositories\n 204\n" "\n \n Repositories\n 204\n"
repos <- parse_number(repos[1])
repos
## [1] 204
stars <- html_nodes(page,".js-selected-navigation-item:nth-child(5)") |>
html_text()
stars
## [1] "\n \n Stars\n 470\n" "\n \n Stars\n 470\n"
stars <- parse_number(stars[1])
stars
## [1] 470
The left side of the webpage contains other useful information about the user. I used selector gadget to find the paths for the personal quotes, number of followers and numbers followed, employer, location, and website. Below you can see the data extraction and cleaning.
Note that the “follow” information is especially tricky because both components (followers and following) are embedded in a single object that needs to be split and extracted separately.
quote <- html_nodes(page,".js-user-profile-bio div") |>
html_text()
quote
## [1] "Sociology and other distractions."
quote <- str_replace_all(quote, "([\n])", "")
follow <- html_nodes(page,".mt-md-0 .mb-3") |>
html_text()
follow
## [1] "\n \n 1.1k\n followers\n · \n 25\n following\n \n "
follow <- unlist(str_split(follow, "\n"))
followers <- str_trim(follow[3], side = c("both"))
following <- str_trim(follow[6], side = c("both"))
followers
## [1] "1.1k"
following
## [1] "25"
employer <- html_nodes(page,".hide-md:nth-child(1)") |>
html_text()
employer
## [1] "Duke University\n"
employer <- str_replace_all(employer, "([\n])", "")
employer
## [1] "Duke University"
location <- html_nodes(page,".hide-md+ .hide-md") |>
html_text()
location
## [1] "NC\n"
location <- str_replace_all(location, "([\n])", "")
location
## [1] "NC"
website <- html_nodes(page,".hide-md~ .hide-md+ .pt-1") |>
html_text()
website
## [1] "http://www.kieranhealy.org\n"
website <- str_replace_all(website, "([\n])", "")
website
## [1] "http://www.kieranhealy.org"
Now that we have gathered these data, they can be combined into a single data frame.
gtest <- as.data.frame(cbind(name, repos, stars, quote, followers, following,
employer, location, website, url))
glimpse(gtest)
## Rows: 1
## Columns: 10
## $ name <chr> "Kieran Healy"
## $ repos <chr> "204"
## $ stars <chr> "470"
## $ quote <chr> "Sociology and other distractions."
## $ followers <chr> "1.1k"
## $ following <chr> "25"
## $ employer <chr> "Duke University"
## $ location <chr> "NC"
## $ website <chr> "http://www.kieranhealy.org"
## $ url <chr> "https://github.com/kjhealy"
It is now possible to automate the scraping of multiple webpages simultaneously because the URLs for GitHub overview pages are stable. By knowing a set of GitHub user names, we can use that information to scrape multiple overview pages simultaneously.
To begin, it will be necessary to generate an empty dataset.
overview <- as.data.frame(NULL)
Next let’s use a broader list of GitHub users. In addition to Dr. Healy, let’s examine the webpages of two other esteemed computational sociologists: Scott Duxbury and Omar Lizardo. Then let’s generate a numerical vector that counts the number of people included in the list of GitHub users.
glist <- c("kjhealy","sduxbury","olizardo")
# create an index so the loop can be run separately for each row in the df
r <- 1:length(glist)
Now it is time to construct the loop. We’ll use “r” as the counting object. The loop will run three times: first for Healy, second for Duxbury, and third for Lizardo. The sections below extract the information for each individual variable. The “NA” command for each variable ensures that a missing value is submitted for absent information. The penultimate step in the loop combines the variables into a single row and the final step appends the row to dataset.
for (i in r) {
url <- print(paste("https://github.com/",glist[i], sep = ""))
page <- read_html(url)
# name
name <- html_nodes(page,".d-block.overflow-hidden") |>
html_text()
name <- str_trim(name, side = c("both"))
name[length(name) == 0] <- NA # ensures that missing info still gets a column
# repos
repos <- html_nodes(page,".selected+ .js-selected-navigation-item") |>
html_text()
repos <- parse_number(repos[1])
repos[length(repos) == 0] <- NA
# stars
stars <- html_nodes(page,".js-selected-navigation-item:nth-child(5)") |>
html_text()
stars <- parse_number(stars[1])
stars[length(stars) == 0] <- NA
# quote
quote <- html_nodes(page,".js-user-profile-bio div") |>
html_text()
quote <- str_replace_all(quote, "([\n])", "")
quote[length(quote) == 0] <- NA
# follow
follow <- html_nodes(page,".mt-md-0 .mb-3") |>
html_text()
follow <- unlist(str_split(follow, "\n"))
followers <- str_trim(follow[3], side = c("both"))
following <- str_trim(follow[6], side = c("both"))
followers[length(followers) == 0] <- NA
following[length(following) == 0] <- NA
# employer
employer <- html_nodes(page,".hide-md:nth-child(1)") |>
html_text()
employer <- str_replace_all(employer, "([\n])", "")
employer[length(employer) == 0] <- NA
# location
location <- html_nodes(page,".hide-md+ .hide-md") |>
html_text()
location <- str_replace_all(location, "([\n])", "")
location[length(location) == 0] <- NA
# website
website <- html_nodes(page,".hide-md~ .hide-md+ .pt-1") |>
html_text()
website <- str_replace_all(website, "([\n])", "")
website[length(website) == 0] <- NA
# bind the data and append to the dataframe
tempinfo <- as.data.frame(cbind(i, name, repos, stars, quote, followers,
following, employer, location, website, url))
overview <- rbind(overview, tempinfo)
}
## [1] "https://github.com/kjhealy"
## [1] "https://github.com/sduxbury"
## [1] "https://github.com/olizardo"
Observing the results, one can see that the requisite information from each page has been added to dataframe.
glimpse(overview)
## Rows: 3
## Columns: 11
## $ i <chr> "1", "2", "3"
## $ name <chr> "Kieran Healy", "Scott Duxbury", "Omar Lizardo"
## $ repos <chr> "204", "8", "19"
## $ stars <chr> "470", "2", "3"
## $ quote <chr> "Sociology and other distractions.", "Assistant Professor of…
## $ followers <chr> "1.1k", "26", "7"
## $ following <chr> "25", "2", "1"
## $ employer <chr> "Duke University", "UNC--Chapel Hill", "University of Califo…
## $ location <chr> "NC", NA, "Los Angeles, CA"
## $ website <chr> "http://www.kieranhealy.org", NA, "http://olizardo.bol.ucla.…
## $ url <chr> "https://github.com/kjhealy", "https://github.com/sduxbury",…
This approach can be used to gather additional information from the overview page or adapted to gather information from other web pages. Note that this example draws from publicly available information on the internet. Studying private or proprietary online information will require a more careful review of internet scraping protocols to ensure the consent and protection of human subjects.