N is for n_distinct
This article is originally published at http://www.deeplytrivial.com/Today, we'll start digging into some of the functions used to summarise data. The full summarise function will be covered for the letter S. For now, let's look at one function from the tidyverse that can give some overall information about a dataset: n_distinct.
This function counts the number of unique values in a vector or variable. There are 87 books in my 2019 reading list, but I read multiple books by the same author(s). Let's see how many authors there are in my set.
library(tidyverse)
library(magrittr)
reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allrated.csv", col_names = TRUE)
reads2019 %$% n_distinct(Author)
## [1] 42
reads2019 %>%
group_by(Author) %>%
summarise(Books = n()) %>%
arrange(desc(Books), Author) %>%
filter(between(row_number(), 1, 10))
## # A tibble: 10 x 2
## Author Books
## <chr> <int>
## 1 Baum, L. Frank 14
## 2 Pratchett, Terry 13
## 3 King, Stephen 6
## 4 Scalzi, John 6
## 5 Abbott, Mildred 5
## 6 Atwood, Margaret 5
## 7 Patchett, Ann 2
## 8 Ware, Ruth 2
## 9 Adams, Douglas 1
## 10 Adeyemi, Tomi 1
n_distinct can also be used in conjunction with other functions, like filter or group_by.
library(tidytext)
titlewords <- reads2019 %>%
unnest_tokens(titleword, Title) %>%
select(titleword, Author, Book.ID) %>%
left_join(reads2019, by = c("Book.ID", "Author"))
titlewords %>%
group_by(Title) %>%
summarise(unique_words = n_distinct(titleword),
total_words = n())
## # A tibble: 87 x 3
## Title unique_words total_words
## <chr> <int> <int>
## 1 1Q84 1 1
## 2 A Disorder Peculiar to the Country 6 6
## 3 Alas, Babylon 2 2
## 4 Artemis 1 1
## 5 Bird Box (Bird Box, #1) 3 5
## 6 Boundaries: When to Say Yes, How to Say No to Take … 12 15
## 7 Cell 1 1
## 8 Children of Virtue and Vengeance (Legacy of Orïsha,… 8 9
## 9 Cujo 1 1
## 10 Dirk Gently's Holistic Detective Agency (Dirk Gentl… 7 8
## # … with 77 more rows
titlewords %$%
n_distinct(titleword)
## [1] 224
titlewords <- titlewords %>%
anti_join(stop_words, by = c("titleword" = "word"))
titlewords %$%
n_distinct(titleword)
## [1] 181
Thanks for visiting r-craft.org
This article is originally published at http://www.deeplytrivial.com/
Please visit source website for post related comments.