Read a lot of datasets at once with R
This article is originally published at http://www.brodrigues.co/
I often have to read a lot of datasets at once using R. So I’ve wrote the following function to solve this issue:
read_list <- function(list_of_datasets, read_func){
read_and_assign <- function(dataset, read_func){
dataset_name <- as.name(dataset)
dataset_name <- read_func(dataset)
}
# invisible is used to suppress the unneeded output
output <- invisible(
sapply(list_of_datasets,
read_and_assign, read_func = read_func, simplify = FALSE, USE.NAMES = TRUE))
# Remove the extension at the end of the data set names
names_of_datasets <- c(unlist(strsplit(list_of_datasets, "[.]"))[c(T, F)])
names(output) <- names_of_datasets
return(output)
}
You need to supply a list of datasets as well as the function to read the datasets to read_list
. So for example to read in .csv
files, you could use read.csv()
(or read_csv()
from the readr
package, which I prefer to use), or read_dta()
from the package haven
for STATA files, and so on.
Now imagine you have some data in your working directory. First start by saving the name of the datasets in a variable:
data_files <- list.files(pattern = ".csv")
print(data_files)
## [1] "data_1.csv" "data_2.csv" "data_3.csv"
Now you can read all the data sets and save them in a list with read_list()
:
library("readr")
library("tibble")
list_of_data_sets <- read_list(data_files, read_csv)
glimpse(list_of_data_sets)
## List of 3
## $ data_1:Classes 'tbl_df', 'tbl' and 'data.frame': 19 obs. of 3 variables:
## ..$ col1: chr [1:19] "0,018930679" "0,8748013128" "0,1025635934" "0,6246140983" ...
## ..$ col2: chr [1:19] "0,0377725807" "0,5959457638" "0,4429121533" "0,558387159" ...
## ..$ col3: chr [1:19] "0,6241767189" "0,031324594" "0,2238059868" "0,2773350732" ...
## $ data_2:Classes 'tbl_df', 'tbl' and 'data.frame': 19 obs. of 3 variables:
## ..$ col1: chr [1:19] "0,9098418493" "0,1127788509" "0,5818891392" "0,1011773532" ...
## ..$ col2: chr [1:19] "0,7455905887" "0,4015039612" "0,6625796605" "0,029955339" ...
## ..$ col3: chr [1:19] "0,327232932" "0,2784035673" "0,8092386735" "0,1216045306" ...
## $ data_3:Classes 'tbl_df', 'tbl' and 'data.frame': 19 obs. of 3 variables:
## ..$ col1: chr [1:19] "0,9236124896" "0,6303271761" "0,6413583054" "0,5573887416" ...
## ..$ col2: chr [1:19] "0,2114708388" "0,6984538266" "0,0469865249" "0,9271510226" ...
## ..$ col3: chr [1:19] "0,4941919971" "0,7391538511" "0,3876723797" "0,2815014394" ...
If you prefer not to have the datasets in a list, but rather import them into the global environment, you can change the above function like so:
read_list <- function(list_of_datasets, read_func){
read_and_assign <- function(dataset, read_func){
assign(dataset, read_func(dataset), envir = .GlobalEnv)
}
# invisible is used to suppress the unneeded output
output <- invisible(
sapply(list_of_datasets,
read_and_assign, read_func = read_func, simplify = FALSE, USE.NAMES = TRUE))
}
But I personnally don’t like this second option, but I put it here for completeness.
Thanks for visiting r-craft.org
This article is originally published at http://www.brodrigues.co/
Please visit source website for post related comments.