Friday, December 2, 2011

Comparing model selection methods

The standard textbook analysis of different model selection methods, like cross-validation or validation sample, focus on their ability to estimate in-sample, conditional or expected test error. However, the other interesting question is to compare them by their ability to select the true model.
To test this I have thought to generate data following the process y = x1 + ε and test two linear models. In the first one only intercept and parameter for x1 are estimated and in the second a random noise variable x2 is added.

I decided to compare 5-fold Cross-Validation, Adjusted R-squared and Validation Sample (with 70/30 split) methods by testing their ability to select the true model - i.e. the one without x2 variable. I run the test assuming initial sample size equal to 10, 100, 1 000 and 10 000. Here goes my code (corrected after Andrej-Nikolai Spiess comment):

library(boot)
set.seed(1)

decision <- function(n) {
      x1 <- runif(n)
      x2 <- runif(n)
      y <- x1 + rnorm(n)
      train.size <- round(0.7 * n)
      data.set <- data.frame(x1, x2, y)
      train.data <- data.set[1 : train.size,]
      validation.data <- data.set[(train.size + 1) : n,]
      formulas <- list(y ~ x1, y ~ x1 + x2)
      cv <- arsq <- valid <- list()
      for (i in 1:2) {
            cv[[i]] <- cv.glm(data.set, glm(formulas[[i]]),
                              K = 5)$delta[1]
            arsq[[i]] <- summary(lm(formulas[[i]]))$adj.r.squared
            valid.lm <- lm(formulas[[i]], data = train.data)
            valid[[i]] <- mean((predict(valid.lm, validation.data)
                           - validation.data$y)^2)
      }
      return(c(cv[[1]] < cv[[2]],
               arsq[[1]] > arsq[[2]],
               valid[[1]] < valid[[2]]))
}

correct <- function(n) {
      rowMeans(replicate(2000, decision(n)))
}

n <- c(10, 100, 1000, 10000)
results <- sapply(n, correct)
rownames(results) <- c("CV", "adjR2", "VALID")
colnames(results) <- n
print(results)

The results are given below and are quite interesting:

           10    100   1000  10000
CV     0.7260 0.6430 0.6585 0.6405
adjR2  0.6570 0.6735 0.6575 0.6965
VALID  0.6195 0.6275 0.6220 0.6240

Cross-Validation performance is the best for small samples but deteriorates with sample size. Validation Sample approach performance does not change with sample size and is a bit worse than Cross-Validation. Adjusted R-squared method is actually better than other methods for large samples (again - thanks for Andrej-Nikolai Spiess comment).

What is interesting for me is that increasing sample size does not lead to sure selection of the true model for Cross-Validation and Validation Sample methods. And sure - for large samples the estimate of the parameter for variale x2 is near 0 so the mistake is not very significant, but I did not expect such results.

4 comments:

  1. Andrej-Nikolai SpiessDecember 3, 2011 at 5:27 AM

    Hi Bogumit,

    i made some research on model selection methods for sigmoidal models:

    http://www.biomedcentral.com/1471-2210/10/6

    There I found that the Information Criteria (AIC, corrected AIC, BIC) outperform most of the other methods. This is also in your case. I modified your code a bit (included BIC, and pre-allocate the list in each loop, so R doesn't have to 'grow' the list). I also think that arsq[[1]] > arsq[[2]] because model 1 should have a higher R-square when correctly selected. Have a look how good BIC performs:

    library(boot)
    set.seed(1)

    decision <- function(n) {
    x1 <- runif(n)
    x2 <- runif(n)
    y <- x1 + rnorm(n)
    train.size <- round(0.7 * n)
    data.set <- data.frame(x1, x2, y)
    train.data <- data.set[1 : train.size,]
    validation.data <- data.set[(train.size + 1) : n,]
    formulas <- list(y ~ x1, y ~ x1 + x2)
    cv <- arsq <- valid <- bic <- anov <- vector("list", length = 2)
    for (i in 1:2) {
    cv[[i]] <- cv.glm(data.set, glm(formulas[[i]]), K = 5)$delta[1]
    LM <- lm(formulas[[i]])
    anov[[i]] <- LM
    arsq[[i]] <- summary(LM)$adj.r.squared
    bic[[i]] <- BIC(LM)
    valid.lm <- lm(formulas[[i]], data = train.data)
    valid[[i]] <- mean((predict(valid.lm, validation.data) - validation.data$y)^2)
    }
    return(c(cv[[1]] < cv[[2]],
    arsq[[1]] > arsq[[2]],
    valid[[1]] < valid[[2]],
    bic[[1]] < bic[[2]]))
    }

    correct <- function(n) {
    rowMeans(replicate(200, decision(n)))
    }

    n <- c(10, 100, 1000)
    results <- sapply(n, correct)
    rownames(results) <- c("CV", "adjR2", "VALID", "BIC")
    colnames(results) <- n
    print(results)

    10 100 1000
    CV 0.730 0.620 0.600
    adjR2 0.660 0.705 0.650
    VALID 0.635 0.600 0.595
    BIC 0.770 0.950 0.985

    ReplyDelete
  2. Thanks Andrej. And I will correct the original post with Adjusted R-squared mistake.

    ReplyDelete
  3. Adding the following rule:

    pval<- summary(lm(formulas[[2]],data=data.set))$coef[3,4]

    pval>0.05

    Imediately gives rates of correct model close to 1. I recently read the article:
    http://www.ecb.int/pub/pdf/scpwps/ecbwp195.pdf, which states that in-sample significance testing outperforms out-of-sample statistics. This example illustrates this point.

    ReplyDelete
  4. Andrej-Nikolai SpiessDecember 5, 2011 at 11:32 AM

    Another possibility is to do nested F-tests to see if the additional parameter x2 adds significantly in reducing the residual sum-of-squares:

    anova(lm(formulas[[1]]), lm(formulas[[2]]))[2, 6]

    ReplyDelete