This page is designed to serve as a basic introduction to R programming language. Let’s dive in!
R is a programming language used by more than 2 million people across the globe to conduct statistical analysis. RStudio is the graphical user interface (GUI) that make R super easy to use.
For one, if you are taking my class, it’s required. But beyond that, I think there are many reasons to invest in learning R.
It’s free! R is open source software available to anyone for the low price of absolutely nothing. So if you change schools or jobs, you don’t need to worry about whether you will have access to R in your new role.
R is developed in continuous time rather than through bundled updates. What does that mean? Well, R is constantly being improved by users who develop new packages. Those packages are available to other users as soon as possible. By contrast, other (proprietary) stats programs delay access to new routines so that they can be incorporated in to new release versions of their software. Bottom line: R provides access to cutting edge stats tools (13,000+ R packages) before any other software.
R provides greater flexibility in conducting statistical analysis. It is considered to be a “low-level” programming language, which means you have more control over what you can do. Higher level stats programs have a set of routines that limit what you can do as a statistician – although this can make it a bit easier to do simple things.
This flexibility is aided by “object-orientation” – the creation and manipulation of a wide variety of objects. Objects can be almost anything: a data matrix, a vector of numbers, or a single character. R allows you to generate and alter these various objects, which is different from standard stats packages that limit you to working within traditional data matrices. Object orientation in R makes it easier to work with a wider range of stats objects and to work across multiple related objects (i.e., relational databases).
When I first started using R (in the mid-aughts), it was powerful, but also difficult to use. It was like early programming in which you could create things, but lacked a way to easily visualize what you were creating or how those things fit together. Those were the “bare-R” days prior to the integration of RStudio, which has effectively removed those barriers to statistical programming.
RStudio provides four windows for simultaneous viewing of different elements of the programming process. The “console” is the input-output window. You can enter commands on the command line and see the generated output. The “script” window shows the files that we write our R commands in. It is preferable to generate these script files, rather than extensively typing commands into the console because it makes it easier to reproduce the work we have previously done.
Back in the day, in “bare-R”, these were the only two screens available to coders. To see how these two components work in conjunction with one another, let’s look at an example of running code in a script file and observing the output that shows up in the console. For this, we can ask R to solve simple math problems.
5 + 5
## [1] 10
3 * (5 + 5)
## [1] 30
The math problems generate numerical responses that show up in the console window. However, this code only asks for the solution to the math problems. Often, it is important to store this information so that we can use it later. Here’s an example.
x <- 2 * 8
x
## [1] 16
The “<-” portion of the code tells R to assign the math problem solution (16) to an object named x. Now, you can “call” on x whenever needed. You have created your first object!
The “environment” window in RStudio records all of the new objects created in each session. That way, you can easily scroll through and examine the various objects you created.
The fourth window (“plots”) includes several useful features. First, any graphical displays you generate will appear in the plot tab. Second, by clicking on the “Packages” tab, you can view the various statistical packages and functions that are available to use.
R works by deploying various functions. A package is a group of functions. R has a set of base packages that are available as soon as you start up the program. You can also load additional packages to gain further functionality. Let’s begin by working with the base package.
One of the functions in the base package is the log function, which takes the natural logarithm of a number. Let’s see an example of this.
log(10)
## [1] 2.302585
To take a closer look at this function, you can type into the console (or run from the script file) a question mark before the name of the function: ?log. The information about the function will display in the plots window. This information will include details about the origins, alternative versions, and different options to include as part of the code.
We can combine functions with objects, such as the following which uses the x object we created previously (x = 16) to generate a log value, which is then recorded in a new object named z.
z <- log(x)
This process is foundational for work in R: generate objects and manipulate them with functions.
As noted above, you can view the objects you create in the environment window. They can be displayed in the console via the following command.
ls()
## [1] "x" "z"
To save an object for later use, you will want to first set your working directory via the setwd command. I set this based on my own working directory, so you will need to adjust the code to a folder on your computer.
Then you can use the save command to store objects x and z as a .rda file in your working directory. The .rda format is the standard R data format.
setwd("Q:/My Drive/Teaching/DSC295/scripts")
save(x, z, file="xz.rda")
Now let’s remove all objects from the list and then load the file we just saved.
Note that I have added comments to the line prior to each command to explain what the code is accomplishing. To add a comment, add a “#”, which tells R that anything on the line after the “#” is a comment and not a piece of the code.
# remove all objects from the environment
rm(list=ls())
# list all objects in the environment. Should be none since they have been removed
ls()
## character(0)
# load the file containing our previous objects
load("xz.rda")
# list again...now the objects have returned
ls()
## [1] "x" "z"
# display the content of object z
z
## [1] 2.772589
The simplest object is a single piece of data, like a number or a character. A vector is a more complex object which contains multiple pieces of data. We can generate vectors by using the combine command. Below we demonstrate this by generating a numeric vector and a character vector.
numbers <- c(1,2,3,4,5)
words <- c("Welcome","to","the","new","semester")
numbers
## [1] 1 2 3 4 5
words
## [1] "Welcome" "to" "the" "new" "semester"
In this class, we will work mostly with numeric vectors, which can be generated in various ways. For example…
# generate a sequence of numbers from 1 to 20
seq <- c(1:20)
seq
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# generate a reverse sequence from 20 to 1
seqrev <- c(20:1)
seqrev
## [1] 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
# generate a random number sequence of 100 numbers with an average of 50 and a standard deviation of 5
set.seed(8675309) # set a random number seed so the results are reproducible
seqrandom <- rnorm(100, mean = 50, sd = 5)
There are many instances in which we might want to identify specific elements within a vector. To do this, we use brackets to “subset” the vector.
# show the 15th number in the random vector
seqrandom[15]
## [1] 45.02705
# show the 1st,4th, and 5th word in the character vector
words[c(1,4,5)]
## [1] "Welcome" "new" "semester"
It is often essential to combine multiple vectors into a single object. For example, survey data contain a series of variables that are represented along the columns and a set of cases (respondents or people) along the columns. If we have those separate vectors, and those vectors are all of the same length, they can be combined into a matrix. Below I simulate how this is done via the cbind (or column bind) command.
# generate several vectors
ID <- c(1:100)
V1 <- rnorm(100, mean = 50, sd = 5) # random number distribution
V2 <- rbinom(100, size = 1, prob = .35) # binomial (0 or 1) distribution
V3 <- rep(c("Welcome","to","the","new","semester")) # character vector that repeats
matrix <- cbind(ID,V1,V2,V3)
# display the first six lines of the matrix
head(matrix)
## ID V1 V2 V3
## [1,] "1" "53.7226883882519" "0" "Welcome"
## [2,] "2" "43.2668599597606" "0" "to"
## [3,] "3" "51.6507125500059" "0" "the"
## [4,] "4" "49.9363733690973" "0" "new"
## [5,] "5" "47.6816201985462" "1" "semester"
## [6,] "6" "51.0247104516438" "0" "Welcome"
Matrices can sometimes be difficult to work with, especially if we are interested in examining the columns as variables. Therefore, we might want to take this object and convert it to a data frame. With a data frame, we can more easily call on and analyze the contents of a specific variable. See below.
# show the current class of our object
class(matrix)
## [1] "matrix" "array"
# convert the object into a data frame
df <- as.data.frame(matrix)
class(df)
## [1] "data.frame"
head(df)
## ID V1 V2 V3
## 1 1 53.7226883882519 0 Welcome
## 2 2 43.2668599597606 0 to
## 3 3 51.6507125500059 0 the
## 4 4 49.9363733690973 0 new
## 5 5 47.6816201985462 1 semester
## 6 6 51.0247104516438 0 Welcome
This is another frequently used task in R. You will have an object that is in an incorrect format, which will need to be changed to a more appropriate format.
In fact, this conversion was only partially successful, because it did not effectively “parse” the variables. In other words, R did not know what types of variables (numeric or character) each vector represented, so it assigned all variables to character. We know that V1 and V2 are numeric, so let’s reassign them.
# note that these are character vectors, not numeric
class(df$V1)
## [1] "character"
class(df$V2)
## [1] "character"
# reassign the formatting
df$V1 <- as.numeric(df$V1)
df$V2 <- as.numeric(df$V2)
# this looks better!
class(df$V1)
## [1] "numeric"
class(df$V2)
## [1] "numeric"
# summarize the distributions for V1 and V2
summary(df$V1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 37.01 46.47 50.10 49.75 53.36 59.57
summary(df$V2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 0.35 1.00 1.00
Note that we accessed the variables embedded within the data frame by using the dollar sign ($).
We can easily generate plots in R to visualize features of the data we are working with. For example, let’s visualize the distribution of V1 in our data frame.
plot(density(df$V1))
Recall that we asked R to generate a random normal distribution of 100 numbers with a mean of 50 and a standard deviation of 5. The results look spot on! The variable has a mean of 49.75 and a standard deviation of 4.95.
So far we’ve been working with functions that are part of the base package. It is possible to also use functions from other packages. We just need to install those packages and bring them into our library.
Let’s install a package call “statnet” that is designed for conducting social network analysis.
# install.packages("statnet") # Once you install the package, you can comment this out, as I've done here.
library(statnet)
To take a closer look at the statnet package, type the following into the console: help(package=“statnet”)
Many packages include data that can be used to work with its various functions. statnet includes a network dataset that is based on the relationships among prominent families in 15th Century Renaissance Florence. Below we can access the data.
# access the data from the statnet package
data(florentine)
# show the features from one of the datasets embedded in the florentine data
flomarriage # this represents marriage ties between the families
## Network attributes:
## vertices = 16
## directed = FALSE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## bipartite = FALSE
## total edges= 20
## missing edges= 0
## non-missing edges= 20
##
## Vertex attribute names:
## priorates totalties vertex.names wealth
##
## No edge attributes
Now let’s generate a plot of the marriage ties using the gplot function from statnet.
# remove the margins from the plot area so the graph can be maximum size
par(mar=c(0,0,0,0)) # the number sequence goes bottom, left, top, and right margins
gplot(flomarriage,
gmode = "graph",
displaylabels=TRUE)
From your history courses, you might be familiar with one of these families. Medici was the most powerful banking family in Europe, in large part because they were able to use marriage ties to become a central actor in the Florentine trading community.
Learning R can be frustrating, but there is a lot of help online. I recommend using ChatGPT’s R code Helper to assist you when you get stuck. Try it out. It’s a great tool!