Topic modeling Edgar Allan Poe writings

Let’s use topic modeling to differentiate the major works of Edgar Allan Poe. First, let’s bring in the necessary packages.

library(tidyverse)
library(gutenbergr)
library(stringr)
library(topicmodels)
library(tidytext)
library(dplyr)

Let’s begin by extracting four of Poe’s major works: The Masque of Red Death, The Raven, The Cask of Amontillado, and The Fall of the House of Usher. Below creates vector numbers for those works and then extracts them from the gutenberg project.

poe_nums <- c(932,1063,1064,1065)
books <- gutenberg_download(poe_nums,
                            meta_fields = "title")
## Determining mirror for Project Gutenberg from
## https://www.gutenberg.org/robot/harvest.
## Using mirror http://aleph.gutenberg.org.

Next we will need to preprocess the books. This involves tokenizing the books (breaking them into rows based on words), removing stop words (which are not helpful when examining semantic content), and finally casting the words into a document term matrix.

poe_words <- books |> 
  unnest_tokens(word,text)

word_counts <- poe_words %>%
  anti_join(stop_words) %>%
  count(title, word, sort = TRUE)
## Joining with `by = join_by(word)`
poe_dtm <- word_counts %>%
  cast_dtm(title, word, n)

Time to estimate the LDA model. Because we only have four books, we will only ask for two topics (k=2).

poe_lda <- LDA(poe_dtm, k = 2, control = list(seed = 1234))

Lets examine the word weights (betas) associated with each topic.

poe_betas <- tidy(poe_lda, matrix = "beta")

top_terms <- poe_betas %>%
  group_by(topic) %>%
  slice_max(beta, n = 20) %>% 
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

Now check out how each book loads on the topics.

poe_gammas <- tidy(poe_lda, matrix = "gamma")
poe_gammas <- poe_gammas %>%
  separate(document, c("title"), sep = "_", convert = TRUE)

poe_gammas %>%
  mutate(title = reorder(title, gamma * topic)) %>%
  ggplot(aes(factor(topic), gamma)) +
  geom_boxplot() +
  facet_wrap(~ title) +
  labs(x = "topic", y = expression(gamma))

What can we tell about the differences between these groups? What explains why the algorithm decided to group them in this way?

(more to come…)