R Code / R News / R statistical package

Statistics Sunday: Creating Wordclouds

by Sara · June 10, 2018

This article is originally published at http://www.deeplytrivial.com/

Cloudy with a Chance of WordsLots of fun projects in the works, so today's post will be short - a demonstration on how to create wordclouds, both with and without sentiment analysis results. While I could use song lyrics again, I decided to use a different dataset that comes with the quanteda packages: all 58 Inaugural Addresses, from Washington's first speech in 1789 to Trump's in 2017.

library(quanteda) #install with install.packages("quanteda") if needed

data(data_corpus_inaugural)
speeches <- data_corpus_inaugural$documents
row.names(speeches) <- NULL

As you can see, this dataset has each Inaugural Address in a column called "texts," with year and President's name as additional variables. To analyze the words in the speeches, and generate a wordcloud, we'll want to unnest the words in the texts column.

library(tidytext)
library(tidyverse)

speeches_tidy <- speeches %>%
  unnest_tokens(word, texts) %>%
  anti_join(stop_words)

## Joining, by = "word"

For our first wordcloud, let's see what are the most common words across all speeches.

library(wordcloud) #install.packages("wordcloud") if needed

speeches_tidy %>%
  count(word, sort = TRUE) %>%
  with(wordcloud(word, n, max.words = 50))

While the language used by Presidents certainly varies by time period and the national situation, these speeches refer often to the people and the government; in fact, most of the larger words directly reference the United States and Americans. The speeches address the role of "president" and likely the "duty" that role entails. The word "peace" is only slightly larger than "war," and one could probably map out which speeches were given during wartime and which weren't.

We could very easily create a wordcloud for one President specifically. For instance, let's create one for Obama, since he provides us with two speeches worth of words. But to take things up a notch, let's add sentiment information to our wordcloud. To do that, we'll use the comparison.cloud function; we'll also need the reshape2 library.

library(reshape2) #install.packages("reshape2") if needed

obama_words <- speeches_tidy %>%
  filter(President == "Obama") %>%
  count(word, sort = TRUE)

obama_words %>%
  inner_join(get_sentiments("nrc") %>%
               filter(sentiment %in% c("positive",
                                       "negative"))) %>%
  filter(n > 1) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red","blue"))

## Joining, by = "word"

The acast statement reshapes the data, putting our sentiments of positive and negative as separate columns. Setting fill = 0 is important, since a negative word will be missing a value for the positive column and vice versa; without fill = 0, it would drop any row with NA in one of the columns (which would be every word in the set). As a sidenote, we could use the comparison cloud to compare words across two documents, such as comparing two Presidents. The columns would be counts for each President, as opposed to count by sentiment.

Interestingly, the NRC classifies "government" and "words" as negative. But even if we ignore those two words, which are Obama's most frequent, the negatively-valenced words are much larger than most of his positively-valenced words. So while he uses many more positively-valenced words than negatively-valenced words - seen by the sheer number of blue words - he uses the negatively-valenced words more often. If you were so inclined, you could probably run a sentiment analysis on his speeches and see if they tend to be more positive or negative, and/or if they follow arcs of negativity and positivity. And feel free to generate your own wordcloud: all you'd need to do is change the filter(President == "") to whatever President you're interested in examining (or whatever text data you'd like to use, if President's speeches aren't your thing).

Thanks for visiting r-craft.org
This article is originally published at http://www.deeplytrivial.com/
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Statistics Sunday: Creating Wordclouds

You may also like...

Categories

Statistics Sunday: Creating Wordclouds

You may also like...

Radix for R Markdown

Fun with R and graphs on the dawn of 2014

Intersecting points and overlapping polygons

Categories