## Monday, December 2, 2013

### Speeding up model bootstrapping in GNU R

After my last post I have recurringly received two questions: (a) is it worthwhile to analyze GNU R speed in simulations and (b) how would simulation speed compare between GNU R and Python. In this post I want to address the former question and next time I am going to tackle the latter.

An area in which I use simulation in GNU R on a daily basis is bootstrapping. Therefore I have decided to check out how much speedup one can expect with tuning of standard model estimation procedures.

I started with linear regression: a simple model with two explanatory variables and 100 observations. I wanted to compare speed of generation of 10000 bootstrap replications using standard lm function and bare matrix calculus formula for b solving the equation: X'Xb=X'y. Here is the simulation:

set.seed(1)
n <- 100
x1 <- runif(n)
x2 <- runif(n)
y <- x1 + x2 + 1 + rexp(n) # add non-normal error distribution
data.m <- cbind("(Intercept)" = 1, x1, x2, y)
data.f <- data.frame(x1, x2, y)

boot.bare <- function() {
dm <- data.m[sample.int(n, replace = T),]
X <- dm[, 1:3]
tX <- t(X)
y <- dm[, 4]
solve(tX %*% X, tX %*% y)
}

boot.lm <- function() {
lm(y~., data=data.f[sample.int(n, replace = T),])\$coef
}

set.seed(2)
system.time(rb <- replicate(10000, boot.bare())[, 1,])
#  0.73 seconds

set.seed(2)
system.time(rl <- replicate(10000, boot.lm()))
# 16.00 seconds

# check if both procedures produce the same results
range(rb - rl)
# OK difference ranges from -5.373479e-14 to 4.751755e-14

Both formulas yield practically identical results but using lm produces over 20 times slower execution. Of course lm does much more work than simple calculation of parameter estimates - boot for bootstrapping I need only the parameters.

So next I thought to check out a more complex model. Therefore I switched to logistic regression.
I compared glm estimation with two alternatives: brute-force optimization of log-likelihood using nlm and direct computation of estimates using iteratively reweighted least squares (IRLS, the method actually used by glm internally). Here is the code:

set.seed(1)
n <- 100
x1 <- runif(n)
x2 <- runif(n)
y <- x1 + x2 + runif(n) < 1.5 # add non-logistic error distribution
data.m <- cbind(1, x1, x2, y)
data.f <- data.frame(x1, x2, y)

boot.nlm <- function() {
llik <- function(a) {
prob <- 1 / (1 + exp(-X %*% a))
prob[y == F] <- 1 - prob[y == F]
-sum(log(prob))
}

dm <- data.m[sample.int(n, replace = T),]
X <- dm[, 1:3]
tX <- t(X)
y <- dm[, 4]
nlm(llik, c(1, 2, 3))\$estimate
}

boot.irls <- function() {
dm <- data.m[sample.int(n, replace = T),]
X <- dm[, 1:3]
tX <- t(X)
y <- dm[, 4]
b <- rep(0, 3)
while(TRUE) {
prob <- 1 / (1 + exp(-X %*% b))
pp <- as.vector(prob * (1 - prob))
pp <- rbind(pp, pp, pp)
grad <- tX %*% (y - prob)
b <- b + solve((tX * pp) %*% X, grad)
return(b)
}
}
}

boot.glm <- function() {
x <- data.f[sample.int(n, replace = T),]
glm(y~., data=x, family = "binomial")\$coef
}

set.seed(2)
system.time(rn <- replicate(10000, boot.nlm()))
# 35.48 seconds
# warnings related to procedure convergence produced

set.seed(2)
system.time(ri <- replicate(10000, boot.irls())[, 1,])
# 5.49 seconds

set.seed(2)
system.time(rg <- replicate(10000, boot.glm()))
# 34.68 seconds

# check if all procedures produce the same results
range(rn - ri) # -0.0002572192  0.0002470589
range(rn - rg) # -0.0002572191  0.0002470588
range(ri - rg) # -4.196647e-07  3.602730e-07

We can see that using nlm is the slowest and produces unstable results and that IRLS and glm give identical results but IRLS is 6 times faster.

What are the conclusions? If you really need it is is possible to substantially reduce computational time even for "core" functions.
However, simplest means are not always guaranteed to be efficient (like nlm in logistic regression) and there is a question if the speedup pays of, as there is a substantial additional development effort needed.

1. Note that lm() actually calls lm.fit(), so without all the wrapper code, we can use lm.fit and it is much faster:

boot.lm.fit <- function() {
dm <- data.m[sample.int(n, replace = T),]
X <- dm[, 1:3]
tX <- t(X)
y <- dm[, 4]
lm.fit(x=X, y=y)\$coef
}

Now what is lm.fit actually doing? it's a wrapper for c-code to solve the least squares problem with qr decomposition, so why not just do that ourselves? this function is the fastest of all four options. the solutions should be the same, check it yourself

boot.qrls <- function() {
dm <- data.m[sample.int(n, replace = T),]
X <- dm[, 1:3]
tX <- t(X)
y <- dm[, 4]
tol = 1e-07
.Call(stats:::C_Cdqrls, X, y, tol)\$coefficients
}

1. the "tX <- t(X)" lines can be deleted from both functions, which leads to a further improvement. missed that

2. Here's an improved version of boot.bare(). it takes advantage of R's built in functions for matrix algebra:

boot.bare2 <- function() {
dm <- data.m[sample.int(n, replace = T),]
X <- dm[, 1:3]
y <- dm[, 4]
solve(crossprod(X), crossprod(X,y))
}

3. This comment has been removed by the author.

4. As Jared already suggested: Calling lm() and analogously glm() is unnecessary here because the formula processing needs to be done in every iteration. Using lm.fit() or glm.fit() is much faster. For the logit example:

boot.glm2 <- function() {
dm <- data.m[sample.int(n, replace = TRUE),]
glm.fit(dm[, 1:3], dm[, 4], family = binomial(), control = glm.control())\$coefficients
}

And then you can use it as

set.seed(2)
system.time(rg2 <- replicate(10000, boot.glm2()))

which reduces the computation time by a bit more than 50% on my machine. This is not quite competitive with boot.irls() concerning speed but is likely to be more precise. It uses a relative tolerance of 1e-8 on the deviance/log-likelihood (rather than an absolute tolerance of 1e-6 on the gradient).

5. Thanks for the comments. To sum up results of application of Jared's ideas on my PC:
boot.bare2 gives ~30% speed improvement over boot.bare
boot.qrls gives ~40% speed improvement over boot.bare

1. Note that (on my computer at least) if you increase n, eventually boot.bare2 becomes faster than boot.qrls