# Graph analysis using the tidyverse

This article is originally published at https://rviews.rstudio.com/index.xml

It is because I am not a graph analysis expert that I though it important to write this article. For someone who thinks in terms of single rectangular data sets, it is a bit of a mental leap to understand how to apply *tidy* principles to a more robust object, such as a graph table. Thankfully, there are two packages that make this work much easier:

`tidygraph`

- Provides a way for`dplyr`

to interact with graphs`ggraph`

- Extension to`ggplot2`

for graph analysis

### Quick intro

Simply put, graph theory studies relationships between objects in a group. Visually, we can think of a graph as a series of interconnected circles, each representing a member of a group, such as people in a Social Network. Lines drawn between the circles represent a relationship between the members, such as friendships in a Social Network. Graph analysis helps with figuring out things such as the influence of a certain member, or how many friends are in between two members. A more formal definition and detailed explanation of Graph Theory can be found in Wikipedia here.

## Example

Using an example, this article will introduce concepts of graph analysis work, and how `tidyverse`

and `tidyverse`

-adjacent tools can be used for such analysis.

### Data source

The tidytuesday weekly project encourages new and experienced users to use the `tidyverse`

tools to analyze data sets that change every week. I have been using that opportunity to lean new tools and techniques. One of the most recent data sets relates to French trains; it contains aggregate daily total trips per connecting stations.

```
library(readr)
url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-26/small_trains.csv"
small_trains <- read_csv(url)
```

`head(small_trains)`

```
## # A tibble: 6 x 13
## year month service departure_stati… arrival_station journey_time_avg
## <int> <int> <chr> <chr> <chr> <dbl>
## 1 2017 9 Nation… PARIS EST METZ 85.1
## 2 2017 9 Nation… REIMS PARIS EST 47.1
## 3 2017 9 Nation… PARIS EST STRASBOURG 116.
## 4 2017 9 Nation… PARIS LYON AVIGNON TGV 161.
## 5 2017 9 Nation… PARIS LYON BELLEGARDE (AI… 164.
## 6 2017 9 Nation… PARIS LYON BESANCON FRANC… 129.
## # … with 7 more variables: total_num_trips <int>,
## # avg_delay_all_departing <dbl>, avg_delay_all_arriving <dbl>,
## # num_late_at_departure <int>, num_arriving_late <int>,
## # delay_cause <chr>, delayed_number <dbl>
```

### Data Preparation

Even though it was meant to analyze delays, I thought it would be interesting to use the data to understand how stations connect with each other. A new summarized data set is created, called *routes*, which contains a single entry for each connected station. It also includes the average journey time it takes to go between stations.

```
library(dplyr)
routes <- small_trains %>%
group_by(departure_station, arrival_station) %>%
summarise(journey_time = mean(journey_time_avg)) %>%
ungroup() %>%
mutate(from = departure_station,
to = arrival_station) %>%
select(from, to, journey_time)
routes
```

```
## # A tibble: 130 x 3
## from to journey_time
## <chr> <chr> <dbl>
## 1 AIX EN PROVENCE TGV PARIS LYON 186.
## 2 ANGERS SAINT LAUD PARIS MONTPARNASSE 97.5
## 3 ANGOULEME PARIS MONTPARNASSE 146.
## 4 ANNECY PARIS LYON 225.
## 5 ARRAS PARIS NORD 52.8
## 6 AVIGNON TGV PARIS LYON 161.
## 7 BARCELONA PARIS LYON 358.
## 8 BELLEGARDE (AIN) PARIS LYON 163.
## 9 BESANCON FRANCHE COMTE TGV PARIS LYON 131.
## 10 BORDEAUX ST JEAN PARIS MONTPARNASSE 186.
## # … with 120 more rows
```

The next step is to transform the tidy data set, into a graph table. In order to prepare *routes* for this transformation, it has to contain two variables specifically named: *from* and *to*, which are the names that `tidygraph`

expects to see. Those variables should contain the name of each member (e.g., “AIX EN PROVENCE TGV”), and the relationship (“AIX EN PROVENCE TGV” -> “PARIS LYON”) .

In graph terminology, a member of the group is called a **node** (or vertex) in the graph, and a relationship between nodes is called an **edge**.

```
library(tidygraph)
graph_routes <- as_tbl_graph(routes)
graph_routes
```

```
## # A tbl_graph: 59 nodes and 130 edges
## #
## # A directed simple graph with 1 component
## #
## # Node Data: 59 x 1 (active)
## name
## <chr>
## 1 AIX EN PROVENCE TGV
## 2 ANGERS SAINT LAUD
## 3 ANGOULEME
## 4 ANNECY
## 5 ARRAS
## 6 AVIGNON TGV
## # … with 53 more rows
## #
## # Edge Data: 130 x 3
## from to journey_time
## <int> <int> <dbl>
## 1 1 39 186.
## 2 2 40 97.5
## 3 3 40 146.
## # … with 127 more rows
```

The `as_tbl_graph()`

function splits the *routes* table into two:

Node Data - Contains all of the unique values found in the

*from*and*to*variables. In this case, it is a table with a single column containing the names of all of the stations.Edge Data - Is a table of all relationships between

*from*and*to*. A peculiarity of`tidygraph`

is that it uses the row position of the node as the identifier for*from*and*to*, instead of its original name.

Another interesting thing about `tidygraph`

is that it allows us to attach more information about the node or edge in an additional column. In this case, *journey_time* is not really needed to create the graph table, but it may be needed for the analysis we plan to perform. The `as_tbl_graph()`

function automatically created the column for us.

Thinking about *graph_routes* as two `tibbles`

inside a larger table graph, was one of the two major mental breakthroughs I had during this exercise. At that point, it became evident that `dplyr`

needs a way to know which of the two tables (nodes or edges) to perform the transformations on. In `tidygraph`

, this is done using the `activate()`

function. To showcase this, the nodes table will be “activated” in order to add two new string variables derived from *name*.

```
library(stringr)
graph_routes <- graph_routes %>%
activate(nodes) %>%
mutate(
title = str_to_title(name),
label = str_replace_all(title, " ", "\n")
)
graph_routes
```

```
## # A tbl_graph: 59 nodes and 130 edges
## #
## # A directed simple graph with 1 component
## #
## # Node Data: 59 x 3 (active)
## name title label
## <chr> <chr> <chr>
## 1 AIX EN PROVENCE TGV Aix En Provence Tgv "Aix\nEn\nProvence\nTgv"
## 2 ANGERS SAINT LAUD Angers Saint Laud "Angers\nSaint\nLaud"
## 3 ANGOULEME Angouleme Angouleme
## 4 ANNECY Annecy Annecy
## 5 ARRAS Arras Arras
## 6 AVIGNON TGV Avignon Tgv "Avignon\nTgv"
## # … with 53 more rows
## #
## # Edge Data: 130 x 3
## from to journey_time
## <int> <int> <dbl>
## 1 1 39 186.
## 2 2 40 97.5
## 3 3 40 146.
## # … with 127 more rows
```

It was really impressive how easy it was to manipulate the graph table, because once one of the two tables are activated, all of the changes can be made using `tidyverse`

tools. The same approach can be used to extract data from the graph table. In this case, a list of all the stations is pulled into a single character vector.

```
stations <- graph_routes %>%
activate(nodes) %>%
pull(title)
stations
```

```
## [1] "Aix En Provence Tgv" "Angers Saint Laud"
## [3] "Angouleme" "Annecy"
## [5] "Arras" "Avignon Tgv"
## [7] "Barcelona" "Bellegarde (Ain)"
## [9] "Besancon Franche Comte Tgv" "Bordeaux St Jean"
## [11] "Brest" "Chambery Challes Les Eaux"
## [13] "Dijon Ville" "Douai"
## [15] "Dunkerque" "Francfort"
## [17] "Geneve" "Grenoble"
## [19] "Italie" "La Rochelle Ville"
## [21] "Lausanne" "Laval"
## [23] "Le Creusot Montceau Montchanin" "Le Mans"
## [25] "Lille" "Lyon Part Dieu"
## [27] "Macon Loche" "Madrid"
## [29] "Marne La Vallee" "Marseille St Charles"
## [31] "Metz" "Montpellier"
## [33] "Mulhouse Ville" "Nancy"
## [35] "Nantes" "Nice Ville"
## [37] "Nimes" "Paris Est"
## [39] "Paris Lyon" "Paris Montparnasse"
## [41] "Paris Nord" "Paris Vaugirard"
## [43] "Perpignan" "Poitiers"
## [45] "Quimper" "Reims"
## [47] "Rennes" "Saint Etienne Chateaucreux"
## [49] "St Malo" "St Pierre Des Corps"
## [51] "Strasbourg" "Stuttgart"
## [53] "Toulon" "Toulouse Matabiau"
## [55] "Tourcoing" "Tours"
## [57] "Valence Alixan Tgv" "Vannes"
## [59] "Zurich"
```

## Visualizing

In graphs, the absolute position of the each node is not as relevant as it is with other kinds of visualizations. A very minimal `ggplot2`

theme is set to make it easier to view the plotted graph.

```
library(ggplot2)
thm <- theme_minimal() +
theme(
legend.position = "none",
axis.title = element_blank(),
axis.text = element_blank(),
panel.grid = element_blank(),
panel.grid.major = element_blank(),
)
theme_set(thm)
```

To create the plot, start with `ggraph()`

instead of `ggplot2()`

. The `ggraph`

package contains `geoms`

that are unique to graph analysis. The package contains `geoms`

to specifically plot nodes, and other `geoms`

for edges.

As a first basic test, the *point* `geom`

will be used, but instead of calling`geom_point()`

, we call `geom_node_point()`

. The edges are plotted using `geom_edge_diagonal()`

.

```
library(ggraph)
graph_routes %>%
ggraph(layout = "kk") +
geom_node_point() +
geom_edge_diagonal()
```

To make it easier to see where each station is placed in this plot, the `geom_node_text()`

is used. Just as with regular `geoms`

in `ggplot2`

, other attributes such as `size`

, `color`

, and `alpha`

can be modified.

```
graph_routes %>%
ggraph(layout = "kk") +
geom_node_text(aes(label = label, color = name), size = 3) +
geom_edge_diagonal(color = "gray", alpha = 0.4)
```

## Morphing time!

The second mental leap was understanding how a graph algorithm is applied. Typically, the output of a model function is a model object, not a data object. With `tidygraph`

, the process begins and ends with a graph table. The steps are these:

- Start with a graph table
- Temporarily transform the graph to comply with the model that is requested (
`morph()`

) - Add additional transformations to the morphed data using
`dplyr`

(optional) - Restore the original graph table, but modified to keep the changes made during the morph

The shortest path algorithm defines the “length” as the number of edges in between two nodes. There may be multiple routes to get from point A to point B, but the algorithm chooses the one with the fewest number of “hops”. The way to call the algorithm is inside the `morph()`

function. Even though `to_shortest_path()`

is a function in itself, and it is possible run it without `morph()`

, it is not meant to be used that way. In the example, the *journey_time* is used as `weights`

to help the algorithm find an optimal route between the *Arras* and the *Nancy* stations. The print output of the morphed graph will not be like the original graph table.

```
from <- which(stations == "Arras")
to <- which(stations == "Nancy")
shortest <- graph_routes %>%
morph(to_shortest_path, from, to, weights = journey_time)
shortest
```

```
## # A tbl_graph temporarily morphed to a shortest path representation
## #
## # Original graph is a directed simple graph with 1 component
## # consisting of 59 nodes and 130 edges
```

It is possible to make more transformations with the use of `activate()`

and `dplyr`

functions. The results can be previewed, or committed back to the original R variable using `unmorph()`

. By default, nodes are active in a morphed graph, so there is no need to set that explicitly.

```
shortest %>%
mutate(selected_node = TRUE) %>%
unmorph()
```

```
## # A tbl_graph: 59 nodes and 130 edges
## #
## # A directed simple graph with 1 component
## #
## # Node Data: 59 x 4 (active)
## name title label selected_node
## <chr> <chr> <chr> <lgl>
## 1 AIX EN PROVENCE T… Aix En Provence T… "Aix\nEn\nProvence\n… NA
## 2 ANGERS SAINT LAUD Angers Saint Laud "Angers\nSaint\nLaud" NA
## 3 ANGOULEME Angouleme Angouleme NA
## 4 ANNECY Annecy Annecy NA
## 5 ARRAS Arras Arras TRUE
## 6 AVIGNON TGV Avignon Tgv "Avignon\nTgv" NA
## # … with 53 more rows
## #
## # Edge Data: 130 x 3
## from to journey_time
## <int> <int> <dbl>
## 1 1 39 186.
## 2 2 40 97.5
## 3 3 40 146.
## # … with 127 more rows
```

While it was morphed, only the few nodes that make up the connections between the Arras and Nancy stations were selected. A simple `mutate()`

adds a new variable called *selected_node*, which tags those nodes with TRUE. The new variable and value is retained once the rest of the nodes are restored via the `unmorph()`

command.

To keep the change, the *shortest* variable is updated with the changes made to both edges and nodes.

```
shortest <- shortest %>%
mutate(selected_node = TRUE) %>%
activate(edges) %>%
mutate(selected_edge = TRUE) %>%
unmorph()
```

The next step is to coerce each NA into a 1, and the shortest route into a 2. This will allow us to easily re-arrange the order that the edges are drawn in the plot, ensuring that the route will be drawn at the top.

```
shortest <- shortest %>%
activate(nodes) %>%
mutate(selected_node = ifelse(is.na(selected_node), 1, 2)) %>%
activate(edges) %>%
mutate(selected_edge = ifelse(is.na(selected_edge), 1, 2)) %>%
arrange(selected_edge)
shortest
```

```
## # A tbl_graph: 59 nodes and 130 edges
## #
## # A directed simple graph with 1 component
## #
## # Edge Data: 130 x 4 (active)
## from to journey_time selected_edge
## <int> <int> <dbl> <dbl>
## 1 1 39 186. 1
## 2 2 40 97.5 1
## 3 3 40 146. 1
## 4 4 39 225. 1
## 5 6 39 161. 1
## 6 7 39 358. 1
## # … with 124 more rows
## #
## # Node Data: 59 x 4
## name title label selected_node
## <chr> <chr> <chr> <dbl>
## 1 AIX EN PROVENCE T… Aix En Provence T… "Aix\nEn\nProvence\n… 1
## 2 ANGERS SAINT LAUD Angers Saint Laud "Angers\nSaint\nLaud" 1
## 3 ANGOULEME Angouleme Angouleme 1
## # … with 56 more rows
```

A simple way to plot the route is to use the *selected_* variables to modify the `alpha`

. This will highlight the shortest path, without completely removing the other stations. This is a personal design choice, so experimenting with different ways of highlighting the results is always recommended.

```
shortest %>%
ggraph(layout = "kk") +
geom_edge_diagonal(aes(alpha = selected_edge), color = "gray") +
geom_node_text(aes(label = label, color =name, alpha = selected_node ), size = 3)
```

The *selected_* fields can also be used in other `dplyr`

functions to analyze the results. For example, to know the aggregate information about the trip, *selected_edge* is used to filter the edges, and then the totals can be calculated. There is no `summarise()`

function for graph tables; this make sense because the graph table would become a summarized table with such a function. Since the end result we seek is a total rather than another graph table, a simple `as_tibble()`

command will coerce the edges, which will then allows us to finish the calculation.

```
shortest %>%
activate(edges) %>%
filter(selected_edge == 2) %>%
as_tibble() %>%
summarise(
total_stops = n() - 1,
total_time = round(sum(journey_time) / 60)
)
```

```
## # A tibble: 1 x 2
## total_stops total_time
## <dbl> <dbl>
## 1 8 23
```

## Re-using the code

To compile most of the code in a single chunk, here is an example of how to re-run the shortest path for a different set of stations: the Laval and Montpellier stations.

```
from <- which(stations == "Montpellier")
to <- which(stations == "Laval")
shortest <- graph_routes %>%
morph(to_shortest_path, from, to, weights = journey_time) %>%
mutate(selected_node = TRUE) %>%
activate(edges) %>%
mutate(selected_edge = TRUE) %>%
unmorph() %>%
activate(nodes) %>%
mutate(selected_node = ifelse(is.na(selected_node), 1, 2)) %>%
activate(edges) %>%
mutate(selected_edge = ifelse(is.na(selected_edge), 1, 2)) %>%
arrange(selected_edge)
shortest %>%
ggraph(layout = "kk") +
geom_edge_diagonal(aes(alpha = selected_edge), color = "gray") +
geom_node_text(aes(label = label, color =name, alpha = selected_node ), size = 3)
```

Additional, the same code can be recycled to obtain the trip summarized data.

```
shortest %>%
activate(edges) %>%
filter(selected_edge == 2) %>%
as_tibble() %>%
summarise(
total_stops = n() - 1,
total_time = round(sum(journey_time) / 60)
)
```

```
## # A tibble: 1 x 2
## total_stops total_time
## <dbl> <dbl>
## 1 3 10
```

## Shiny app

To see how to use this kind of analysis inside Shiny, please refer to this application. It lets the user select two stations, and it returns the route, plus the summarized data. The source code is embedded in the app.

Thanks for visiting r-craft.org

This article is originally published at https://rviews.rstudio.com/index.xml

Please visit source website for post related comments.