Another Benchmark for Joining Two Data Frames
This article is originally published at https://statcompute.wordpress.com
In my post yesterday comparing efficiency in joining two data frames, I overlooked the computing cost used to convert data.frames to data.tables / ff data objects. Today, I did the test again with the consideration of library loading and data conversion. After the replication of 10 times in rbenchmark package, the joining method with data.table is almost 10 times faster than the other in terms of user time. Although ff package is claimed to be able to handle large-size data, its efficiency seems questionable.
n <- 1000000 set.seed(2013) ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n)) rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n)) library(rbenchmark) benchmark(replications = 10, order = "user.self", # GENERIC MERGE() IN BASE PACKAGE merge = merge(ldf, rdf, by = c("id1", "id2")), # DATA.TABLE PACKAGE datatable = { ldt <- data.table::data.table(ldf, key = c("id1", "id2")) rdt <- data.table::data.table(rdf, key = c("id1", "id2")) merge(ldt, rdt, by = c("id1", "id2")) }, # FF PACKAGE ff = { lff <- ff::as.ffdf(ldf) rff <- ff::as.ffdf(rdf) merge(lff, rff, by = c("id1", "id2")) }, # SQLDF PACKAGE sqldf = sqldf::sqldf(c("create index ldx on ldf(id1, id2)", "select * from main.ldf inner join rdf on ldf.id1 = rdf.id1 and ldf.id2 = rdf.id2")) ) # test replications elapsed relative user.self sys.self user.child # 2 datatable 10 17.923 1.000 16.605 1.432 0 # 4 sqldf 10 105.002 5.859 102.294 3.345 0 # 1 merge 10 131.279 7.325 119.139 13.049 0 # 3 ff 10 187.150 10.442 154.670 33.758 0
Thanks for visiting r-craft.org
This article is originally published at https://statcompute.wordpress.com
Please visit source website for post related comments.