How to use jailbreakr
This article is originally published at http://www.brodrigues.co/
What is jailbreakr
The jailbreakr
package is probably one of the most interesting packages I came across recently.
This package makes it possible to extract messy data from spreadsheets. What is meant by messy? I
am sure you already had to deal with spreadsheets that contained little tables inside a single
sheet for example. As far as I know, there is no simple way of extracting these tables without having
to fiddle around a lot. This is now over with jailbreakr
. Well not entirely, because jailbreakr
is still in development, but it works well already. If you want to know more about the planned
features, you can watch the
following
video by
Jenny Bryan, one of the package’s authors.
Installation and data
You will have to install the package from Github, as it is not on CRAN yet. Here is the Github link. To install the package, just run the following commands in an R console:
devtools::install_github(c("hadley/xml2",
"rsheets/linen",
"rsheets/cellranger",
"rsheets/rexcel",
"rsheets/jailbreakr"))
If you get the following error:
devtools::install_github("hadley/xml2")
Downloading GitHub repo hadley/xml2@master
from URL https://api.github.com/repos/hadley/xml2/zipball/master
Error in system(full, intern = quiet, ignore.stderr = quiet, ...) :
error in running command
and if you’re on a GNU+Linux distribution try to run the following command:
options(unzip = "internal")
and then run github_install()
again.
As you can see, you need some other packages to make it work. Now we are going to get some data. We
are going to download some time series from the European Commission, data I had to deal with
recently. Download the data by clicking here
and look for the spreadsheet titled Investment_total_factors_nace2.xlsx
. The data we are interested
in is on the second sheet, named TOT
. You cannot import this sheet easily into R because there
are four tables on the same sheet. Let us use jailbreakr
to get these tables out of the sheet and
into nice, tidy, data frames.
jailbreakr
to the rescue
The first step is to read the data in. For this, we are going to use the rexcel
package, which is
also part of the rsheets
organization on Github that was set up by Jenny Brian and Rich Fitzjohn,
the authors of these packages. rexcel
imports the sheet you want but not in a way that is
immediately useful to you. It just gets the sheet into R, which makes it then possible to use
jailbreakr
’s magic on it. First, let’s import the packages we need:
library("rexcel")
library("jailbreakr")
We need to check which sheet to import. There are two sheets, and we want to import the one called
TOT
, the second one. But is it really the second one? I have noticed that sometimes, there are
hidden sheets which makes importing the one you want impossible. So first, let use use another
package, readxl
and its function excel_sheets()
to make sure we are extracting the sheet we
really need:
sheets <- readxl::excel_sheets(path_to_data)
tot_sheet <- which(sheets == "TOT")
print(tot_sheet)
## [1] 3
As you can see, the sheet we want is not the second, but the third! Let us import this sheet into R now (this might take more time than you think; on my computer it takes around 10 seconds):
my_sheet <- rexcel_read(path_to_data, sheet = tot_sheet)
Now we can start using jailbreakr
. The function split_sheet()
is the one that splits the sheet
into little tables:
tables <- split_sheet(my_sheet)
str(tables)
## List of 4
## $ :Classes 'worksheet_view', 'R6' <worksheet_view>
## Public:
## cells: active binding
## clone: function (deep = FALSE)
## data: NULL
## dim: 34 28
## header: NULL
## idx: list
## initialize: function (sheet, xr, filter, header, data)
## lookup: active binding
## lookup2: active binding
## merged: active binding
## sheet: worksheet, R6
## table: function (col_names = TRUE, ...)
## values: function ()
## xr: cell_limits, list
## $ :Classes 'worksheet_view', 'R6' <worksheet_view>
## Public:
## cells: active binding
## clone: function (deep = FALSE)
## data: NULL
## dim: 33 28
## header: NULL
## idx: list
## initialize: function (sheet, xr, filter, header, data)
## lookup: active binding
## lookup2: active binding
## merged: active binding
## sheet: worksheet, R6
## table: function (col_names = TRUE, ...)
## values: function ()
## xr: cell_limits, list
## $ :Classes 'worksheet_view', 'R6' <worksheet_view>
## Public:
## cells: active binding
## clone: function (deep = FALSE)
## data: NULL
## dim: 32 28
## header: NULL
## idx: list
## initialize: function (sheet, xr, filter, header, data)
## lookup: active binding
## lookup2: active binding
## merged: active binding
## sheet: worksheet, R6
## table: function (col_names = TRUE, ...)
## values: function ()
## xr: cell_limits, list
## $ :Classes 'worksheet_view', 'R6' <worksheet_view>
## Public:
## cells: active binding
## clone: function (deep = FALSE)
## data: NULL
## dim: 33 28
## header: NULL
## idx: list
## initialize: function (sheet, xr, filter, header, data)
## lookup: active binding
## lookup2: active binding
## merged: active binding
## sheet: worksheet, R6
## table: function (col_names = TRUE, ...)
## values: function ()
## xr: cell_limits, list
tables
is actually a list containing worksheet_view
objects. Take a look at the dim
attribute: you see the dimensions of the tables there. When I started using jailbreakr
I was
stuck here. I was looking for the function that would extract the data frames and could not find
it. Then I watched the video and I understood what I had to do: a worksheet_view
object has a
values()
method that does the extraction for you. This is a bit unusual in R (it made me feel
like I was using Python); maybe in future versions this values()
method will become a separate
function of its own in the package. What happens when we use values()
?
library("purrr")
list_of_data <- map(tables, (function(x)(x$values())))
map(list_of_data, head)
## [[1]]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] "TOT" NA NA NA NA NA NA NA NA NA
## [2,] "DEMAND" 33603 33969 34334 34699 35064 35430 35795 36160 36525
## [3,] "FDEMT" "FDEMN" NA NA NA NA NA NA NA NA
## [4,] "EU" ":" 16.9 -1.4 20.2 34.5 31.4 37.5 39 37.3
## [5,] "EA" ":" 15.5 -13.1 14.8 30.9 25.1 35.2 39.2 37.1
## [6,] "BE" ":" ":" ":" ":" ":" ":" ":" 42.3 43.1
## [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## [1,] NA NA NA NA NA NA NA NA NA NA NA
## [2,] 36891 37256 37621 37986 38352 38717 39082 39447 39813 40178 40543
## [3,] NA NA NA NA NA NA NA NA NA NA NA
## [4,] 39.2 27.5 20.6 21.4 29.8 26.4 32.5 47.1 19 -1.3 23.5
## [5,] 39.5 25.3 18.2 18.9 27.4 23 28.2 46.1 12.3 -9.3 19.3
## [6,] 45.8 42.2 42.9 43.8 45.8 47.4 49.1 50.9 48.2 46.9 46.3
## [,22] [,23] [,24] [,25] [,26] [,27] [,28]
## [1,] NA NA NA NA NA NA NA
## [2,] 40908 41274 41639 42004 42369 42735 43100
## [3,] NA NA NA NA NA NA NA
## [4,] 29 22 21.1 25.6 31.8 22.9 "30.7"
## [5,] 26.2 18.6 15.7 21.7 28.8 17.3 26.6
## [6,] 46.8 47.1 48.2 50.1 49.2 34.5 34.4
##
## [[2]]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] "FINANCIAL" 33603 33969 34334 34699 35064 35430 35795 36160 36525
## [2,] "FFINT" "FFINN" NA NA NA NA NA NA NA NA
## [3,] "EU" ":" -5.1 -6.2 2.7 6.7 9 14.4 13.9 14
## [4,] "EA" ":" -8.8 -13.5 -3.4 2.6 5.7 12.5 13.2 13.1
## [5,] "BE" ":" ":" ":" ":" ":" ":" ":" 21.5 22.4
## [6,] "BG" ":" ":" ":" ":" ":" ":" ":" ":" ":"
## [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## [1,] 36891 37256 37621 37986 38352 38717 39082 39447 39813 40178 40543
## [2,] NA NA NA NA NA NA NA NA NA NA NA
## [3,] 16.4 9.4 7.4 8.1 12.4 8.4 13.6 23.4 4.1 -4 10.9
## [4,] 16.5 8 6.8 5.1 9.9 4.8 8.4 24.3 -2.8 -10.5 9.3
## [5,] 20.9 22.3 32.2 33.5 33.8 34.8 35 34.5 37.2 33.5 32.7
## [6,] ":" ":" 20.8 24 27.1 28.3 33.4 37.5 37.7 26.6 30.4
## [,22] [,23] [,24] [,25] [,26] [,27] [,28]
## [1,] 40908 41274 41639 42004 42369 42735 43100
## [2,] NA NA NA NA NA NA NA
## [3,] 12.4 10.2 8.8 13.4 17.4 6.2 "12.3"
## [4,] 9 7.2 5 11 13.1 -1 6.5
## [5,] 31.5 32.3 33 31.7 32.2 19.9 20.5
## [6,] 33.8 35.6 36 41.5 41.6 44.2 43.8
##
## [[3]]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] "TECHNICAL" 33603 33969 34334 34699 35064 35430 35795 36160 36525
## [2,] "FTECT" "FTECN" NA NA NA NA NA NA NA NA
## [3,] "EU" ":" 39.2 37.6 38.3 40 40.7 42.8 43.5 43.8
## [4,] "EA" ":" 39.7 36.2 37.5 41.2 40 44 44.8 44.9
## [5,] "BE" ":" ":" ":" ":" ":" ":" ":" 58.8 58.5
## [6,] "BG" ":" ":" ":" ":" ":" ":" ":" ":" ":"
## [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## [1,] 36891 37256 37621 37986 38352 38717 39082 39447 39813 40178 40543
## [2,] NA NA NA NA NA NA NA NA NA NA NA
## [3,] 37 31.1 27.2 30.9 30.4 30.3 27.4 40.5 25.8 23.1 27.4
## [4,] 37 30.3 27.4 31 29.9 29.7 24.8 41 23.4 19.5 26.4
## [5,] 58.3 58.4 57.7 59.2 59.6 59.4 60.2 59.5 60.5 57.9 56.3
## [6,] ":" ":" 17.3 17.5 21.1 21.5 25.3 28.2 26.1 21 25.3
## [,22] [,23] [,24] [,25] [,26] [,27] [,28]
## [1,] 40908 41274 41639 42004 42369 42735 43100
## [2,] NA NA NA NA NA NA NA
## [3,] 28.9 26.3 31.3 32.1 32.1 30.2 "34.6"
## [4,] 28.5 25.9 32.1 32.4 33.1 30.2 36
## [5,] 56.7 57.7 57.9 58.6 59.1 13.1 13.1
## [6,] 24.6 26.8 30.4 31.9 34.1 34.8 33.7
##
## [[4]]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## [1,] "OTHER" 33603 33969 34334 34699 35064 35430 35795 36160 36525 36891
## [2,] "FOTHT" "FOTHN" NA NA NA NA NA NA NA NA NA
## [3,] "EU" ":" 2.9 -0.5 3.9 3.9 1 4.1 4.7 7 7.2
## [4,] "EA" ":" 2.3 -4.9 1.4 1.3 -2.4 1.1 3.2 5.8 7
## [5,] "BE" ":" ":" ":" ":" ":" ":" ":" 14 14.9 15.9
## [6,] "BG" ":" ":" ":" ":" ":" ":" ":" ":" ":" ":"
## [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
## [1,] 37256 37621 37986 38352 38717 39082 39447 39813 40178 40543 40908
## [2,] NA NA NA NA NA NA NA NA NA NA NA
## [3,] -1.5 6.2 8.1 7.6 1.4 2.4 13.7 -1.9 -3.2 1.1 1.1
## [4,] -3.7 5.5 7.1 7.2 -2.2 0.4 15.5 -4.6 -8.4 0.3 -3.3
## [5,] 16.3 22.8 23.1 22.4 24.5 25.3 25.5 26.6 26.6 24.7 24.6
## [6,] ":" -2.3 -0.8 2.4 2.9 3.5 4.8 5.5 2.2 3.3 3.2
## [,23] [,24] [,25] [,26] [,27] [,28]
## [1,] 41274 41639 42004 42369 42735 43100
## [2,] NA NA NA NA NA NA
## [3,] -1.6 0.9 2.7 1.9 -3.3 "2.1"
## [4,] -2.3 0.6 2.5 2.1 -5.4 1.7
## [5,] 26.4 25.9 25 25.3 4.7 5.2
## [6,] 5.9 7 8.2 9.6 9.4 9.1
We are getting really close to something useful! Now we can get the first table and do some basic cleaning to have a tidy dataset:
dataset1 <- list_of_data[[1]]
dataset1 <- dataset1[-c(1:3), ]
dataset1[dataset1 == ":"] <- NA
colnames(dataset1) <- c("country", seq(from = 1991, to = 2017))
head(dataset1)
## country 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002
## [1,] "EU" NA 16.9 -1.4 20.2 34.5 31.4 37.5 39 37.3 39.2 27.5 20.6
## [2,] "EA" NA 15.5 -13.1 14.8 30.9 25.1 35.2 39.2 37.1 39.5 25.3 18.2
## [3,] "BE" NA NA NA NA NA NA NA 42.3 43.1 45.8 42.2 42.9
## [4,] "BG" NA NA NA NA NA NA NA NA NA NA NA 39.6
## [5,] "CZ" NA NA NA NA NA NA NA NA NA NA NA 54.9
## [6,] "DK" 49.5 45 50 59.5 62.5 55.5 60.5 57.5 56 61.5 57.5 59.5
## 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
## [1,] 21.4 29.8 26.4 32.5 47.1 19 -1.3 23.5 29 22 21.1 25.6 31.8 22.9
## [2,] 18.9 27.4 23 28.2 46.1 12.3 -9.3 19.3 26.2 18.6 15.7 21.7 28.8 17.3
## [3,] 43.8 45.8 47.4 49.1 50.9 48.2 46.9 46.3 46.8 47.1 48.2 50.1 49.2 34.5
## [4,] 43 42.8 45.5 49.1 52.6 50.7 39.5 45.5 47.4 45.6 50.5 51.4 49.9 53.2
## [5,] 37 48.5 67.9 66.4 66.8 69.3 64.7 61 56 47.5 53 53.5 67.5 58
## [6,] 53.5 50 59 64 63 56 33.5 57 47 48 52 45.5 40.5 36.5
## 2017
## [1,] "30.7"
## [2,] 26.6
## [3,] 34.4
## [4,] 52.8
## [5,] 59.5
## [6,] 37.5
Et voilà! We went from a messy spreadsheet to a tidy dataset in a matter of minutes. Even though this package is still in early development and not all the features that are planned are available, the basics are there and can save you a lot of pain!
Thanks for visiting r-craft.org
This article is originally published at http://www.brodrigues.co/
Please visit source website for post related comments.