R / R News

Performance comparison of converting list to data.frame with R language

by tomaztsql · January 8, 2023

This article is originally published at https://tomaztsql.wordpress.com

When you are working with large datasets performance comes to everyone’s mind. Especially when converting datasets from one data type to another. And choosing the right method can make a huge difference.

So in this case, I will be creating a dummy list, and I will convert the values in the list into data.frame.

Simple function to create a large list (approx. 46MB with 250.000 elements and each element consists of 10 measurements).

cre_l <- function(len,start,end){ return(round(runif(len,start,end),8)) }
myl2 <- list()
# 250.000 elements is approx 46Mb in Size
# 2.500 elements for demo
for (i in 1:2500){ myl2[[i]] <- (cre_l(10,0,50))  }

The list will be transformed into data.frame in such a way that data.frame will have 10 (ten) variables with a number of observations corresponding with the length of the list. To give you the perspective, this code does exactly this:

for (i in 1:250){ myl2[[i]] <- (cre_l(10,0,50))  } 
df <- data.frame(do.call(rbind, myl2))

And you end up from list to data.frame:

There are many ways to convert a list to data.frame. But where it becomes important is, when your list is a larger object. I have written 8 ways to do the conversion (and I know there are at least 20 more).

By far the fastest method were do.call and sapply ways. Both outperforming all other methods with the following snippets:

#do.call
data.frame(do.call(rbind, myl2))
#sapply
data.frame(t(sapply(myl2,c)))

Both methods were consistent with larger list conversions.

And the worst was for loop solution, reduce and as.data.frame. There are no surprises here, just a pause here, that for loop was performing so poorly due to constant row binding to an existing data.frame.

Complete comparison and graph code:

library(data.table)
library(plyr)
library(ggplot2) 

res <- summary(microbenchmark::microbenchmark(
    do_call_solution = {
          sol1 <- NULL
          sol1 <- data.frame(do.call(rbind, myl2))
    }, 
    for_loop_solution = { 
        sol2 <- NULL
        for (i in 1:length(myl2)){ sol2 <- rbind(sol2, data.frame(t(unlist(myl2[i])))) }
    },
     ldply_to_df = { 
          sol3 <- NULL 
          sol3 <- ldply(myl2, sol3)
     },
    ldply_to_c = {
      sol4 <- NULL
      sol4 <- ldply(myl2, c())
    },
    sapply = {
        sol5 <- NULL
        sol5 <- data.frame(t(sapply(myl2,c)))
        },
    recude = {
      sol6 <- NULL
      sol6 <- data.frame(Reduce(rbind, myl2))
    },
    data_table_rbindlist = {
      sol7 <- NULL
      sol7 <- data.frame(t(rbindlist(list(myl2))))
    },
    as_data_frame = {
      sol8 <- NULL
      sol8 <- data.frame(t(as.data.frame(myl2)))
    },
        times = 10L))

# producing graph
ggplot(res, aes(x=expr, y=(mean/1000/60))) + geom_bar(stat="identity", fill = "lightblue") +
     coord_flip() +
     labs(title = "Perfomance comparison", subtitle = "Converting list with 2.500 element to data.frame ") +
     xlab("Methods") + ylab("Conversion time (s)") +
     theme_light() +
     geom_text(aes(label=(round(mean/1000/60*1.000,3))))

Fig 2 : Comparison results with different converting methods on list with 2500 elements

I have also removed the slowest performing conversions and created a 250.000 elements list and compared only the fastest methods with 10 consecutive runs (using the microbenchmark library) and mean values.

Fig 3 : Comparing the fastest methods on a list with 250000 elements

So, using for loops is super slow, and do.call with rbind or sapply will for sure deliver the best performances.

As always, the code is available at Github in repository Useless_R_function and the file is here.

Happy R-coding!

Thanks for visiting r-craft.org
This article is originally published at https://tomaztsql.wordpress.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Performance comparison of converting list to data.frame with R language

You may also like...

Categories

Performance comparison of converting list to data.frame with R language

You may also like...

Analyzing Wine Data in Python: Part 2 (Ensemble Learning and Classification)

Where do you run to? Map your Strava activities on static and Leaflet maps.

Which #TidyTuesday Netflix titles are movies and which are TV shows?

Categories