Statistics Sunday: Creating Wordclouds
This article is originally published at http://www.deeplytrivial.com/
library(quanteda) #install with install.packages("quanteda") if needed
data(data_corpus_inaugural)
speeches <- data_corpus_inaugural$documents
row.names(speeches) <- NULL
As you can see, this dataset has each Inaugural Address in a column called "texts," with year and President's name as additional variables. To analyze the words in the speeches, and generate a wordcloud, we'll want to unnest the words in the texts column.
library(tidytext)
library(tidyverse)
speeches_tidy <- speeches %>%
unnest_tokens(word, texts) %>%
anti_join(stop_words)
For our first wordcloud, let's see what are the most common words across all speeches.
library(wordcloud) #install.packages("wordcloud") if needed
speeches_tidy %>%
count(word, sort = TRUE) %>%
with(wordcloud(word, n, max.words = 50))
We could very easily create a wordcloud for one President specifically. For instance, let's create one for Obama, since he provides us with two speeches worth of words. But to take things up a notch, let's add sentiment information to our wordcloud. To do that, we'll use the comparison.cloud function; we'll also need the reshape2 library.
library(reshape2) #install.packages("reshape2") if needed
obama_words <- speeches_tidy %>%
filter(President == "Obama") %>%
count(word, sort = TRUE)
obama_words %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))) %>%
filter(n > 1) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("red","blue"))
Interestingly, the NRC classifies "government" and "words" as negative. But even if we ignore those two words, which are Obama's most frequent, the negatively-valenced words are much larger than most of his positively-valenced words. So while he uses many more positively-valenced words than negatively-valenced words - seen by the sheer number of blue words - he uses the negatively-valenced words more often. If you were so inclined, you could probably run a sentiment analysis on his speeches and see if they tend to be more positive or negative, and/or if they follow arcs of negativity and positivity. And feel free to generate your own wordcloud: all you'd need to do is change the filter(President == "") to whatever President you're interested in examining (or whatever text data you'd like to use, if President's speeches aren't your thing).
Thanks for visiting r-craft.org
This article is originally published at http://www.deeplytrivial.com/
Please visit source website for post related comments.