R / R News

Another Benchmark for Joining Two Data Frames

by statcompute · January 30, 2013

This article is originally published at https://statcompute.wordpress.com

In my post yesterday comparing efficiency in joining two data frames, I overlooked the computing cost used to convert data.frames to data.tables / ff data objects. Today, I did the test again with the consideration of library loading and data conversion. After the replication of 10 times in rbenchmark package, the joining method with data.table is almost 10 times faster than the other in terms of user time. Although ff package is claimed to be able to handle large-size data, its efficiency seems questionable.

n <- 1000000
set.seed(2013)
ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n))
rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n))

library(rbenchmark)
benchmark(replications = 10, order = "user.self",
  # GENERIC MERGE() IN BASE PACKAGE
  merge = merge(ldf, rdf, by = c("id1", "id2")),
  # DATA.TABLE PACKAGE
  datatable = {
    ldt <- data.table::data.table(ldf, key = c("id1", "id2"))
    rdt <- data.table::data.table(rdf, key = c("id1", "id2"))
    merge(ldt, rdt, by = c("id1", "id2"))
  },
  # FF PACKAGE
  ff = {
    lff <- ff::as.ffdf(ldf)
    rff <- ff::as.ffdf(rdf)
    merge(lff, rff, by = c("id1", "id2"))
  },
  # SQLDF PACKAGE
  sqldf = sqldf::sqldf(c("create index ldx on ldf(id1, id2)",
                         "select * from main.ldf inner join rdf on ldf.id1 = rdf.id1 and ldf.id2 = rdf.id2"))
)

#        test replications elapsed relative user.self sys.self user.child
# 2 datatable           10  17.923    1.000    16.605    1.432          0
# 4     sqldf           10 105.002    5.859   102.294    3.345          0
# 1     merge           10 131.279    7.325   119.139   13.049          0
# 3        ff           10 187.150   10.442   154.670   33.758          0

Thanks for visiting r-craft.org
This article is originally published at https://statcompute.wordpress.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Another Benchmark for Joining Two Data Frames

You may also like...

Categories

Another Benchmark for Joining Two Data Frames

You may also like...

Debuting in a VFL/AFL Grand Final is rare

One Year as a Data Scientist at Simple

roxygen2 7.1.0

Categories