Friday, December 2, 2011

Comparing model selection methods

The standard textbook analysis of different model selection methods, like cross-validation or validation sample, focus on their ability to estimate in-sample, conditional or expected test error. However, the other interesting question is to compare them by their ability to select the true model.
To test this I have thought to generate data following the process y = x1 + ε and test two linear models. In the first one only intercept and parameter for x1 are estimated and in the second a random noise variable x2 is added.

I decided to compare 5-fold Cross-Validation, Adjusted R-squared and Validation Sample (with 70/30 split) methods by testing their ability to select the true model - i.e. the one without x2 variable. I run the test assuming initial sample size equal to 10, 100, 1 000 and 10 000. Here goes my code (corrected after Andrej-Nikolai Spiess comment):

library(boot)
set.seed(1)

decision <- function(n) {
x1 <- runif(n)
x2 <- runif(n)
y <- x1 + rnorm(n)
train.size <- round(0.7 * n)
data.set <- data.frame(x1, x2, y)
train.data <- data.set[1 : train.size,]
validation.data <- data.set[(train.size + 1) : n,]
formulas <- list(y ~ x1, y ~ x1 + x2)
cv <- arsq <- valid <- list()
for (i in 1:2) {
cv[[i]] <- cv.glm(data.set, glm(formulas[[i]]),
K = 5)\$delta[1]
valid.lm <- lm(formulas[[i]], data = train.data)
valid[[i]] <- mean((predict(valid.lm, validation.data)
- validation.data\$y)^2)
}
return(c(cv[[1]] < cv[[2]],
arsq[[1]] > arsq[[2]],
valid[[1]] < valid[[2]]))
}

correct <- function(n) {
rowMeans(replicate(2000, decision(n)))
}

n <- c(10, 100, 1000, 10000)
results <- sapply(n, correct)
colnames(results) <- n
print(results)

The results are given below and are quite interesting:

10    100   1000  10000
CV     0.7260 0.6430 0.6585 0.6405
VALID  0.6195 0.6275 0.6220 0.6240

Cross-Validation performance is the best for small samples but deteriorates with sample size. Validation Sample approach performance does not change with sample size and is a bit worse than Cross-Validation. Adjusted R-squared method is actually better than other methods for large samples (again - thanks for Andrej-Nikolai Spiess comment).

What is interesting for me is that increasing sample size does not lead to sure selection of the true model for Cross-Validation and Validation Sample methods. And sure - for large samples the estimate of the parameter for variale x2 is near 0 so the mistake is not very significant, but I did not expect such results.

1. Andrej-Nikolai SpiessDecember 3, 2011 at 5:27 AM

Hi Bogumit,

i made some research on model selection methods for sigmoidal models:

http://www.biomedcentral.com/1471-2210/10/6

There I found that the Information Criteria (AIC, corrected AIC, BIC) outperform most of the other methods. This is also in your case. I modified your code a bit (included BIC, and pre-allocate the list in each loop, so R doesn't have to 'grow' the list). I also think that arsq[[1]] > arsq[[2]] because model 1 should have a higher R-square when correctly selected. Have a look how good BIC performs:

library(boot)
set.seed(1)

decision <- function(n) {
x1 <- runif(n)
x2 <- runif(n)
y <- x1 + rnorm(n)
train.size <- round(0.7 * n)
data.set <- data.frame(x1, x2, y)
train.data <- data.set[1 : train.size,]
validation.data <- data.set[(train.size + 1) : n,]
formulas <- list(y ~ x1, y ~ x1 + x2)
cv <- arsq <- valid <- bic <- anov <- vector("list", length = 2)
for (i in 1:2) {
cv[[i]] <- cv.glm(data.set, glm(formulas[[i]]), K = 5)\$delta[1]
LM <- lm(formulas[[i]])
anov[[i]] <- LM
bic[[i]] <- BIC(LM)
valid.lm <- lm(formulas[[i]], data = train.data)
valid[[i]] <- mean((predict(valid.lm, validation.data) - validation.data\$y)^2)
}
return(c(cv[[1]] < cv[[2]],
arsq[[1]] > arsq[[2]],
valid[[1]] < valid[[2]],
bic[[1]] < bic[[2]]))
}

correct <- function(n) {
rowMeans(replicate(200, decision(n)))
}

n <- c(10, 100, 1000)
results <- sapply(n, correct)
rownames(results) <- c("CV", "adjR2", "VALID", "BIC")
colnames(results) <- n
print(results)

10 100 1000
CV 0.730 0.620 0.600
VALID 0.635 0.600 0.595
BIC 0.770 0.950 0.985

2. Thanks Andrej. And I will correct the original post with Adjusted R-squared mistake.