R / R News / Rstats

Speeding up the Cluster Bootstrap in R

by diffuseprior · December 11, 2015

This article is originally published at https://diffuseprior.wordpress.com

Back in January 2013 I wrote a blog post showing how to implement a basic cluster/block bootstrap in R. One drawback of the cluster bootstap is the length of time it takes to sample with replacement and create the data samples. Thankfully some of the comments on my previous post illustrated simple ways to get speed gains. However, even with these gains this procedure is extremely time consuming.

I have been using the cluster bootstrap in some of my research and have found another way to speed things up—use parallel processing power. I appreciate I might be somewhat late to the multicore functions, but hopefully somebody has been having a similar issue as me can take solace from this post.

In the code below I demonstrate how the function “clusterApply” from the package “snow” can be used as a replacement for the regular “apply” function. Note the cluster in clusterApply refers to the mulitcore clusters rather than the clusters in the data frame. My code sets up a simple regression problem, wherein the standard error of the the regressor is 0.4. To demonstrate the clustering phenomenon I duplicate the data frame of 10,000 observations 20 times. As a result of this the standard error falls to 0.09 based on the naive estimate of the variance-covariance matrix.

The clustering problem can easily be corrected using the “felm” function from (what I consider the best R package) “lfe”. However, there are many occasions where researchers might want to use econometric techniques that do not lend themselves to a simple variance-covariance correction like the OLS or 2SLS estimators. These are the situations where you wan to use the cluster bootstrap.

The code below demonstrates how this can be done with and without using parallel processing. The only difference is that the parallel processing requires the user to set the number of clusters (again not clusters in the data!) and use clusterApply instead of apply. In this application, using parallel processing reduces the cluster bootstrap time down from 5 mins 42 seconds to 4 mins 6 seconds. This might seem reasonably trivial however in this simple application I am using a relatively small number of observations (10,000). The parallel processing method will get relatively quicker the larger the number of observations. Also, you can increase this by having a computer with more cores.

I appreciate any comments or criticism people might have on the code below. If anybody can think of a way that would help me to speed this up even more I would be delighted to hear it.

# cluster bootstrap with paralell processing
rm(list=ls())

# packages for cluster standard errors
library(lmtest)
library(lfe)
# use multicore functions
library(snow)

# set up simulation
n <- 10000 # number of observations
x <- rnorm(n)
y <- 5 + 2*x + rnorm(n, 0, 40)
# regression
m1 <- lm(y ~ x)
summary(m1)
# standard error is 0.4

# duplicate data
dpt <- 20 # dpt times
dat <- data.frame(x = rep(x, dpt) , y = rep(y, dpt), g = rep(1:n, dpt))

# regressions with no clustering
m2 <- lm(y ~ x, data = dat) # smaller StErrs
summary(m2)
# standard errors are like m1 = 0.09

# now cluster
summary(felm(y ~ x | 0 | 0 | g, data = dat))
# standard errors are like m1 = 0.4

# lets do this with a regular cluster bootstap
reps <- 50 # 50 reps in practice do more
clusters <- unique(dat$g)
boot.res1 <- matrix(NA, nrow = reps, ncol = 1)

# open time stamp
t1 <- Sys.time()
# set the seed 
set.seed(12345)
# do in loop
for(i in 1:reps){
  # sample the clusters with replacement
  units <- sample(clusters, size = length(clusters), replace=T)
  # create bootstap sample with sapply
  df.bs <- sapply(units, function(x) which(dat[,"g"]==x))
  df.bs <- dat[unlist(df.bs),]
  boot.res1[i] <- coef(lm(y ~ x, data = df.bs))[2]
}
# close time stamp
t2 <- Sys.time()
t2 - t1 

sd(boot.res1) # good bootstrap standard errors are = 0.4

# now lets speed up the sapply function from the previous example

boot.res2 <- matrix(NA, nrow = reps, ncol = 1)

# set the seed 
set.seed(12345)

cl <- makeCluster(10)
# open time stamp
t3 <- Sys.time()
# do in loop
for(i in 1:reps){
  # sample the clusters with replacement
  units <- sample(clusters, size = length(clusters), replace = T)
  # now use the 10 cores instead of 1!
  clusterExport(cl, c("dat", "units"))
  # cluster apply instead of regular apply
  df.bs = clusterApply(cl, units, function(x) which(dat$g == x))
  df.bs <- dat[unlist(df.bs),]
  boot.res2[i] <- coef(lm(y ~ x, data = df.bs))[2]
}
# close time stamp
t4 <- Sys.time()
t4 - t3 

stopCluster(cl)

sd(boot.res2) # good bootstrap standard errors are = 0.4

Thanks for visiting r-craft.org
This article is originally published at https://diffuseprior.wordpress.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Speeding up the Cluster Bootstrap in R

You may also like...

Categories

Speeding up the Cluster Bootstrap in R

You may also like...

R Weekly 2021-W04 blogdown, pagedreport, rOpenSci

Mapping a list of functions to a list of datasets with a list of columns as arguments

Dec 2020: “Top 40” New CRAN Packages

Categories