This tutorial will show how to generate data visualizations in R. First we’ll work with ggplot2, then we will do text data visualizations.

Palmer penguins data viz

Who doesn’t love penguins?

The Palmer Penguins dataset includes information about adult penguins near Palmer Station, Antarctica. With these data, we can answer interesting questions about the biological characteristics of different penguin species.

Let’s first load the necessary packages and the penguin data.

library(tidyverse)
library(ggthemes)
library(palmerpenguins)
penguins
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

Below I provide a scatterplot showing the relationship between penguin flipper length and their body mass. I also show how these estimates vary by different species.

penguins |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = species)) +
  geom_smooth(method = "lm") +
  labs(title = "Body mass and flipper length",
       subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
       x = "Flipper length (mm)", y = "Body mass (g)",
       color = "Species", shape = "Species") +
  scale_color_colorblind()

Here we can see that flipper length is positively associated with body mass. The graphic also shows that Gentoo penguins tend to be the largest penguins, with an average body mass of 5076.02 grams, compared to 4201.75 grams for the full sample of penguins.

Movie review word cloud

Next I will generate a text-based visualization (a word cloud) of movie review data. These data are contained in the text2vec package. I will also need to use tidytext to do my data wrangling and the wordcloud package for the visualization. I will also need to extract the movie review data.

library(text2vec)
library(tidytext)
library(wordcloud)

data("movie_review")

Movie review data are separates into positive (ratings = 7+) and negative (ratings = 5 and below). I will split these into separate dataframes

movie_review_pos <- movie_review |> 
  filter(sentiment == 1)

movie_review_neg <- movie_review |> 
  filter(sentiment == 0)

Next I need to tokenize the reviews, which organizes each row by a single word with a total word count for each word.

movie_words_pos <- movie_review_pos |> 
  unnest_tokens(word, review) |> 
  count(word, sort = TRUE)

movie_words_neg <- movie_review_neg |> 
  unnest_tokens(word, review) |> 
  count(word, sort = TRUE)

I also want to remove meaningless words (“stop words”) and numbers from these reviews.

data("stop_words")

movie_words_pos <- movie_words_pos |> 
  anti_join(stop_words) |> 
  filter(!str_detect(word, "^[0-9]")) 

movie_words_neg <- movie_words_neg |> 
  anti_join(stop_words) |> 
  filter(!str_detect(word, "^[0-9]")) 

Now let’s look at the top words for each set of reviews.

head(movie_words_pos)
##     word     n
## 1     br 10040
## 2   film  4151
## 3  movie  3818
## 4   time  1365
## 5  story  1329
## 6 people   899
head(movie_words_neg)
##    word     n
## 1    br 10784
## 2 movie  4667
## 3  film  3874
## 4   bad  1446
## 5  time  1205
## 6 story  1023

Here we can see that the top three words (br, film, and movie) are not helpful for inferring meaning from these reviews. Therefore, I remove the top three rows from each dataset.

movie_words_pos <- movie_words_pos[4:nrow(movie_words_pos),]
movie_words_neg <- movie_words_neg[4:nrow(movie_words_neg),]

Time for the word cloud! I begin by deciding on a color palette for the word cloud.

pal <- brewer.pal(8,"Dark2")

And now for the plot. The code below generates word clouds for the two sets of reviews.

par(mar=c(0,0,2,0))
movie_words_pos |> 
  with(wordcloud(word, n, random.order = FALSE, 
                 max.words = 50, colors=pal, scale = c(5, .5)))
title("Positive Movie Reviews", cex.main = 1.5)

movie_words_neg |> 
  with(wordcloud(word, n, random.order = FALSE, 
                 max.words = 50, colors=pal, scale = c(5, .5)))
title("Negative Movie Reviews", cex.main = 1.5)

The two word clouds show some interesting differences. [Add your own interpretation here.]