This tutorial will show how to generate data visualizations in R. First we’ll work with ggplot2, then we will do text data visualizations.
Who doesn’t love penguins?
The Palmer Penguins dataset includes information about adult penguins near Palmer Station, Antarctica. With these data, we can answer interesting questions about the biological characteristics of different penguin species.
Let’s first load the necessary packages and the penguin data.
library(tidyverse)
library(ggthemes)
library(palmerpenguins)
penguins
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
Below I provide a scatterplot showing the relationship between penguin flipper length and their body mass. I also show how these estimates vary by different species.
penguins |>
ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, shape = species)) +
geom_smooth(method = "lm") +
labs(title = "Body mass and flipper length",
subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
x = "Flipper length (mm)", y = "Body mass (g)",
color = "Species", shape = "Species") +
scale_color_colorblind()
Here we can see that flipper length is positively associated with body mass. The graphic also shows that Gentoo penguins tend to be the largest penguins, with an average body mass of 5076.02 grams, compared to 4201.75 grams for the full sample of penguins.
Next I will generate a text-based visualization (a word cloud) of movie review data. These data are contained in the text2vec package. I will also need to use tidytext to do my data wrangling and the wordcloud package for the visualization. I will also need to extract the movie review data.
library(text2vec)
library(tidytext)
library(wordcloud)
data("movie_review")
Movie review data are separates into positive (ratings = 7+) and negative (ratings = 5 and below). I will split these into separate dataframes
movie_review_pos <- movie_review |>
filter(sentiment == 1)
movie_review_neg <- movie_review |>
filter(sentiment == 0)
Next I need to tokenize the reviews, which organizes each row by a single word with a total word count for each word.
movie_words_pos <- movie_review_pos |>
unnest_tokens(word, review) |>
count(word, sort = TRUE)
movie_words_neg <- movie_review_neg |>
unnest_tokens(word, review) |>
count(word, sort = TRUE)
I also want to remove meaningless words (“stop words”) and numbers from these reviews.
data("stop_words")
movie_words_pos <- movie_words_pos |>
anti_join(stop_words) |>
filter(!str_detect(word, "^[0-9]"))
movie_words_neg <- movie_words_neg |>
anti_join(stop_words) |>
filter(!str_detect(word, "^[0-9]"))
Now let’s look at the top words for each set of reviews.
head(movie_words_pos)
## word n
## 1 br 10040
## 2 film 4151
## 3 movie 3818
## 4 time 1365
## 5 story 1329
## 6 people 899
head(movie_words_neg)
## word n
## 1 br 10784
## 2 movie 4667
## 3 film 3874
## 4 bad 1446
## 5 time 1205
## 6 story 1023
Here we can see that the top three words (br, film, and movie) are not helpful for inferring meaning from these reviews. Therefore, I remove the top three rows from each dataset.
movie_words_pos <- movie_words_pos[4:nrow(movie_words_pos),]
movie_words_neg <- movie_words_neg[4:nrow(movie_words_neg),]
Time for the word cloud! I begin by deciding on a color palette for the word cloud.
pal <- brewer.pal(8,"Dark2")
And now for the plot. The code below generates word clouds for the two sets of reviews.
par(mar=c(0,0,2,0))
movie_words_pos |>
with(wordcloud(word, n, random.order = FALSE,
max.words = 50, colors=pal, scale = c(5, .5)))
title("Positive Movie Reviews", cex.main = 1.5)
movie_words_neg |>
with(wordcloud(word, n, random.order = FALSE,
max.words = 50, colors=pal, scale = c(5, .5)))
title("Negative Movie Reviews", cex.main = 1.5)
The two word clouds show some interesting differences. [Add your own interpretation here.]