Friday, December 16, 2011

Optimal regularization for smoothing splines

In smooth.spline procedure one can use df or spar parameter to control smoothing level. Usually they are not set manually but recently I was asked a question which one of them is a better measure of regularization level. Hastie et al. (2009) discuss the advantages of df but I thought of a simple graphical illustraition of this fact.
I the following criterion to judge that some quantity measures roughness penalty well: Increasing training sample size should should influence the value of optimal roughness penalty level in a monotonic way.
I compared performance of df or spar parametrization using a sample function (defined in gen.data function) for different sizes of training sample size. Here is the code for df parametrization:

set.seed(1)
gen.data <- function(n) {
      x <- runif(n, -2, 2)
      y <- x ^ 2 / 2 + sin(4 * x) + rnorm(n)
      return(data.frame(x, y))
}

df.levels <- seq(5, 15, length.out = 100)
n.train <- (3 ^ (0 : 5)) * (2 ^ (6 : 1))
cols <- 1
reps <- 100
valid <- gen.data(100000)
plot(NULL, xlab = "df", ylab = "mse",
     xlim = c(5, 18), ylim = c(1, 1.3))

for (n in n.train) {
      mse <- rep(0, length(df.levels))
      for (j in 1 : reps) {
            train <- gen.data(n)
            for (i in seq(along.with = df.levels)) {
                  ss <- smooth.spline(train, df = df.levels[i])
                  ss.y <- predict(ss, valid$x)$y
                  mse[i] <- mse[i] + mean((ss.y - valid$y) ^ 2)
            }
      }
      mse <- mse / reps
      lines(df.levels, mse, col = cols, lwd = 2)
      points(df.levels[which.min(mse)], min(mse),
             col = cols, pch = 19)
      text(15, mse[length(mse)], paste("n =", n),
           col = cols, pos = 4)
      cols <- cols + 1
}

It produces the following result:


The plot shows the desired property. Similar plot can be obtained for spar parameter by simple modification of the code:



It is easy to notice that optimal values of spar do not change in a monotonic way as number of observation increases.

This comparison shows that  df is a better measure of regularization level in comparison to spar.

Additionally one can notice that curves for different sizes of training sample intersect for spar parametrization, which is unexpected. It might be only due to the randomness of data generation process, but I have run the simulation several times and the curves always intersected. Unfortunately I do not have the proof what should happen when valid data set size and reps parameter both tend to infinity.

No comments:

Post a Comment