W is for Write and Read Data – Fast
This article is originally published at http://www.deeplytrivial.com/Once again, I'm dipping outside of the tidyverse, but this package and its functions have been really useful in getting data quickly in (and out) of R.
For work, I have to pull in data from a few different sources, and manipulate and work with them to give me the final dataset that I use for much of my analysis. So that I don't have to go through all of that joining, recoding, and calculating each time, I created a final merged dataset as a CSV file that I can load when I need to continue my analysis. The problem is that the most recent version of that file, which contains 13 million+ records, was so large, writing it (and subsequently reading it in later) took forever and sometimes timed out.
That's when I discovered the data.table library, and its fread and fwrite functions. Tidyverse is great for working with CSV files, but a lot of the memory and loading time is used for formatting. fread and fwrite are leaner and get the job done a bit faster. For regular-sized CSV files (like my reads2019 set), the time difference is pretty minimal. But for a 5GB datafile, it makes a huge difference.
library(tidyverse)
system.time(reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allchanges.csv",
col_names = TRUE))
## user system elapsed
## 0.00 0.10 0.14
rm(reads2019)
library(data.table)
system.time(reads2019 <- fread("~/Downloads/Blogging A to Z/SaraReads2019_allchanges.csv"))
## user system elapsed
## 0 0 0
read_csv:
user system elapsed
61.14 11.72 90.56
fread:
user system elapsed
57.97 16.40 57.19
But the real win is in how quickly this package writes CSV data. Using a package called wakefield, I'll randomly generate 10,000,000 records of survey data, then see how it takes to write the data to file using both write_csv and fwrite.
library(wakefield)
## Warning: package 'wakefield' was built under R version 3.6.3
set.seed(42)
reallybigshew <- r_data_frame(n = 10000000,
id,
race,
age,
smokes,
marital,
Start = hour,
End = hour,
iq,
height,
died)
system.time(write_csv(reallybigshew, "~/Downloads/Blogging A to Z/bigdata1.csv"))
## user system elapsed
## 134.22 2.52 137.80
system.time(fwrite(reallybigshew, "~/Downloads/Blogging A to Z/bigdata2.csv"))
## user system elapsed
## 8.65 0.32 2.77
Thanks for visiting r-craft.org
This article is originally published at http://www.deeplytrivial.com/
Please visit source website for post related comments.