Statistics Sunday: Using Text Analysis to Become a Better Writer
This article is originally published at http://www.deeplytrivial.com/
I'm sure we all have our own words we use way too often.
Text analysis can also be used to discover patterns in writing, and for a writer, may be helpful in discovering when we depend too much on certain words and phrases. For today's demonstration, I read in my (still in-progress) novel - a murder mystery called Killing Mr. Johnson - and did the same type of text analysis I've been demonstrating in recent posts.
To make things easier, I copied the document into a text file, and used the read_lines and tibble functions to prepare data for my analysis.
setwd("~/Dropbox/Writing/Killing Mr. Johnson")
library(tidyverse)
KMJ_text <- read_lines('KMJ_full.txt')
KMJ <- tibble(KMJ_text) %>%
mutate(linenumber = row_number())
I kept my line numbers, which I could use in some future analysis. For now, I'm going to tokenize my data, drop stop words, and examine my most frequently used words.
library(tidytext)
KMJ_words <- KMJ %>%
unnest_tokens(word, KMJ_text) %>%
anti_join(stop_words)
KMJ_words %>%
count(word, sort = TRUE) %>%
filter(n > 75) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() + xlab(NULL) + coord_flip()
Fortunately, my top 5 words are the names of the 5 main characters, with the star character at number 1: Emily is named almost 600 times in the book. It's a murder mystery, so I'm not too surprised that words like "body" and "death" are also common. But I know that, in my fiction writing, I often depend on a word type that draws a lot of disdain from authors I admire: adverbs. Not all adverbs, mind you, but specifically (pun intended) the "-ly adverbs."
ly_words <- KMJ_words %>%
filter(str_detect(word, ".ly")) %>%
count(word, sort = TRUE)
head(ly_words)
## # A tibble: 6 x 2
## word n
## <chr> <int>
## 1 emily 599
## 2 finally 80
## 3 quickly 60
## 4 emily’s 53
## 5 suddenly 39
## 6 quietly 38
Since my main character is named Emily, she was accidentally picked up by my string detect function. A few other top words also pop up in the list that aren't actually -ly adverbs. I'll filter those out then take a look at what I have left.
filter_out <- c("emily", "emily's", "emily’s","family", "reply", "holy")
ly_words <- ly_words %>%
filter(!word %in% filter_out)
ly_words %>%
filter(n > 10) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() + xlab(NULL) + coord_flip()
I use "finally", "quickly", and "suddenly" far too often. "Quietly" is also up there. I think the reason so many writers hate on adverbs is because it can encourage lazy writing. You might write that someone said something quietly or softly, but is there a better word? Did they whisper? Mutter? Murmur? Hiss? Did someone "move quickly" or did they do something else - run, sprint, dash?
At the same time, sometimes adverbs are necessary. I mean, can I think of a complete sentence that only includes an adverb? Definitely. Still, it might become tedious if I keep depending on the same words multiple times, and when a fiction book (or really any kind of writing) is tedious, we often give up. These results give me some things to think about as I edit.
Still have some big plans on the horizon, including some new statistics videos, a redesigned blog, and more surprises later! Thanks for reading!
Thanks for visiting r-craft.org
This article is originally published at http://www.deeplytrivial.com/
Please visit source website for post related comments.