Sunday, January 26, 2014

Tuning optim with parscale

I often get questions what is the use of parscale parameter in optim procedure in GNU R. Therefore I have decided to write a simple example showing its usage and importance.

The function I test is a simplified version of estimation problem I had to solve recently. We have two explanatory variables x1 and x2, The issue is that x1 range is much smaller than x2. In the code below I control it by ratio parameter (set to 200 here). Next we have y = exp(ratio*x1)+x2. This is an explained variable - for simplicity we assume that there is no error in the model.

We take 1000 observations and want to fit the regression by minimizing sum of squared errors. I find the parameters using optim function using Nelder-Mead, BFGS and conjugate gradient methods. In the procedure maxit is set to 200000 to ensure that optimization converges (for default maxit value we would not even get algorithm to converge in some of the cases considered).

The final thing is parscale. We test two values - in the first case we leave it unmodified. In this case all variables are treated equally (although we know that unit change of x1 and x2 has significantly different influence on the objective function). In the second case we hint optim to rescale x1 by ratio. Now unit change of rescaled x1 and x2 have approximately equal influence on the objective function. We want to compare number of function and gradient evaluations in all 6 considered scenarios.

Here is the code that does all the calculations. The printed output is given in the comment at the end of the code:

f <- function(a) { exp(a[1] * x1) + a[2] * x2 }
error <- function(a) {sum((y - f(a)) ^ 2) }

n <- 1000
ratio <- 200
a <- c(ratio, 1)
x1 <- runif(1000) / ratio
x2 <- runif(1000)
y <- f(a)

for (m in c("Nelder-Mead", "BFGS", "CG")) {
    result1 <- optim(rep(1, 2), error, method = m,
        control = list(maxit=200000))
    result2 <- optim(rep(1, 2), error, method = m,
        control = list(maxit=200000, parscale = a))
    cat(format(m, justify="left", width=14), names(result1$count),
        "\n  No parscale:", format(result1$counts, width=8),
        "\n  Parscale:   ", format(result2$counts, width=8),
# Nelder-Mead    function gradient
#   No parscale:    17703       NA
#   Parscale:          63       NA

# BFGS           function gradient
#   No parscale:       90       42
#   Parscale:          65       16

# CG             function gradient
#   No parscale:   448479   112123
#   Parscale:         341       69

It can be clearly seen that appropriate rescaling of parameters has very significant influence on optimization converegence speed. It is especially important in Nelder-Mead and conjugate gradient methods in our example.


  1. Very important topic you are touching there! Since in your example you are doing nonlinear regression on an exponential + linear function by minimizing residual sum-of-squares, one must know that the standard 'nls' function offers no way of parameter scaling. However, the fabulous minpack.lm (also function nlsLM) offers a 'diag' parameter for parameter scaling in the 'control.nls.lm' function. But even without setting parameter scales, nlsLM converges in 6 (!) iterations:

    nlsLM(y ~ exp(a * x1) + b * x2, start = list(a = 1, b = 2))
    Nonlinear regression model
    model: y ~ exp(a * x1) + b * x2
    data: parent.frame()
    a b
    200 1
    residual sum-of-squares: 0

    Number of iterations to convergence: 6
    Achieved convergence tolerance: 1.49e-08

    showing that the Levenberg-Marquardt method might be better in these kind of scenarios.


    1. The reason why I used optim is exactly because nls has convergence problems. In fact for maxiter = 1000000 nls does not converge.

      However, I did not know minpack.lm - it works much better :).

    2. It really is so much better than nls (rarely convergence problems)!
      See my post here: