H2O Benchmark for CSV Import
This article is originally published at https://statcompute.wordpress.com
The importFile() function in H2O is extremely efficient due to the parallel reading. The benchmark comparison below shows that it is comparable to the read.df() in SparkR and significantly faster than the generic read.csv().
library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc <- sparkR.session(master = "local", sparkConfig = list(spark.driver.memory = "10g", spark.driver.cores = "4")) library(h2o) h2o.init(max_mem_size = "10g") library(rbenchmark) benchmark(replications = 5, order = "elapsed", relative = "elapsed", csv = { df <- read.csv("Documents/nycflights13.csv") print(nrow(df)) rm(df) }, spk = { df <- read.df("Documents/nycflights13.csv", source = "csv", header = "true", inferSchema = "true") print(nrow(df)) rm(df) }, h2o = { df <- h2o.importFile(path = "Documents/nycflights13.csv", header = TRUE, sep = ",") print(nrow(df)) rm(df) } ) # test replications elapsed relative user.self sys.self user.child sys.child # 3 h2o 5 8.221 1.000 0.508 0.032 0 0 # 2 spk 5 9.822 1.195 0.008 0.004 0 0 # 1 csv 5 16.595 2.019 16.420 0.176 0 0
Thanks for visiting r-craft.org
This article is originally published at https://statcompute.wordpress.com
Please visit source website for post related comments.