R / R News

Efficiency in Joining Two Data Frames

by statcompute · January 29, 2013

This article is originally published at https://statcompute.wordpress.com

In R, there are multiple ways to merge 2 data frames. However, there could be a huge disparity in terms of efficiency. Therefore, it is worthwhile to test the performance among different methods and choose the correct approach in the real-world work.

For smaller data frames with 1,000 rows, all six methods shown below seem to work pretty well except that the approach with sql.select() is significantly slower than the rest. The generic merge() function in the base package is a very natural choice without much overhead of loading additional libraries and converting data frame. sqldf() is also attractive in that it might be the most user-friendly function with a very intuitive syntax.

> n <- 1000
> set.seed(2013)
> ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n))
> rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n))
> 
> # METHOD 1: MERGE
> system.time(join1 <- merge(ldf, rdf, by = c("id1", "id2")))
   user  system elapsed 
  0.032   0.012   0.064 
> 
> # METHOD 2: DATA.TABLE
> ldt <- data.table::data.table(ldf, key = c("id1", "id2"))
> rdt <- data.table::data.table(rdf, key = c("id1", "id2"))
> system.time(join2 <- merge(ldt, rdt, by = c("id1", "id2")))
   user  system elapsed 
  0.028   0.000   0.044 
> 
> # METHOD 3: FF
> lff <- ff::as.ffdf(ldf)
> rff <- ff::as.ffdf(rdf)
> system.time(join3 <- merge(lff, rff, by = c("id1", "id2")))
   user  system elapsed 
  0.044   0.004   0.096 
> 
> # METHOD 4: SQLDF
> system.time(join4 <- sqldf::sqldf(c("create index ldx on ldf(id1, id2)", 
+                                     "select * from main.ldf inner join rdf on ldf.id1 = rdf.id1 and ldf.id2 = rdf.id2")))
   user  system elapsed 
  0.168   0.008   0.332 
> 
> # METHOD 5: PLYR
> system.time(join5 <- plyr::join(ldf, rdf, by = c("id1", "id2"), type = "inner"))
   user  system elapsed 
  0.088   0.020   0.152 
> 
> # METHOD 6: SQL.SELECT
> source("http://sqlselect.googlecode.com/svn/trunk/sql.select.R")
> system.time(join6 <- sql.select("select * from ldf inner join rdf on (`ldf$id1 == rdf$id1 & ldf$id2 == rdf$id2`)"))
   user  system elapsed 
 53.775  19.725  73.813

However, when it comes to mid-size data frames with 1,000,000 rows, the story has changed. First of all, out of six methods shown above, the last two fails directly due to the insufficient memory size in my 32-bit ubuntu virtual machine. In this case, data.table package shows a significant advantage after converting 2 data.frames to data.tables. In additional, it is interesting that although ff and sqldf packages are slower than merge() function for the small-size data with 1,000 rows, both of them seem slightly faster for the data with 1,000,000 rows.

> n <- 1000 ^ 2
> set.seed(2013)
> ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n))
> rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n))
> 
> # METHOD 1: MERGE
> system.time(join1 <- merge(ldf, rdf, by = c("id1", "id2")))
   user  system elapsed 
 55.223  12.437  68.054 
> 
> # METHOD 2: DATA.TABLE
> ldt <- data.table::data.table(ldf, key = c("id1", "id2"))
> rdt <- data.table::data.table(rdf, key = c("id1", "id2"))
> system.time(join2 <- merge(ldt, rdt, by = c("id1", "id2")))
   user  system elapsed 
  0.484   0.008   0.492 
> 
> # METHOD 3: FF
> lff <- ff::as.ffdf(ldf)
> rff <- ff::as.ffdf(rdf)
> system.time(join3 <- merge(lff, rff, by = c("id1", "id2")))
   user  system elapsed 
 49.811  13.821  64.004 
> 
> # METHOD 4: SQLDF
> system.time(join4 <- sqldf::sqldf(c("create index ldx on ldf(id1, id2)", 
+                                     "select * from main.ldf inner join rdf on ldf.id1 = rdf.id1 and ldf.id2 = rdf.id2")))
   user  system elapsed 
 40.418   1.268  42.076

Thanks for visiting r-craft.org
This article is originally published at https://statcompute.wordpress.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Efficiency in Joining Two Data Frames

You may also like...

Categories

Efficiency in Joining Two Data Frames

You may also like...

SAS Macro Performing Breusch–Godfrey Test for Serial Correlation

beta: Evidence-based Software Engineering – book

Obtaining tokens with AzureAuth inside a Shiny app

Categories