R snippets: Comparing model selection methods

Friday, December 2, 2011

Comparing model selection methods

The standard textbook analysis of different model selection methods, like cross-validation or validation sample, focus on their ability to estimate in-sample, conditional or expected test error. However, the other interesting question is to compare them by their ability to select the true model.
To test this I have thought to generate data following the process y = x1 + ε and test two linear models. In the first one only intercept and parameter for x1 are estimated and in the second a random noise variable x2 is added.

I decided to compare 5-fold Cross-Validation, Adjusted R-squared and Validation Sample (with 70/30 split) methods by testing their ability to select the true model - i.e. the one without x2 variable. I run the test assuming initial sample size equal to 10, 100, 1 000 and 10 000. Here goes my code (corrected after Andrej-Nikolai Spiess comment):

library(boot)

set.seed(1)

decision <- function(n) {

x1 <- runif(n)

x2 <- runif(n)

y <- x1 + rnorm(n)

train.size <- round(0.7 * n)

data.set <- data.frame(x1, x2, y)

train.data <- data.set[1 : train.size,]

validation.data <- data.set[(train.size + 1) : n,]

formulas <- list(y ~ x1, y ~ x1 + x2)

cv <- arsq <- valid <- list()

for (i in 1:2) {

cv[[i]] <- cv.glm(data.set, glm(formulas[[i]]),
K = 5)$delta[1]

arsq[[i]] <- summary(lm(formulas[[i]]))$adj.r.squared

valid.lm <- lm(formulas[[i]], data = train.data)

valid[[i]] <- mean((predict(valid.lm, validation.data)

- validation.data$y)^2)

}

return(c(cv[[1]] < cv[[2]],

arsq[[1]] > arsq[[2]],

valid[[1]] < valid[[2]]))

}

correct <- function(n) {

rowMeans(replicate(2000, decision(n)))

}

n <- c(10, 100, 1000, 10000)

results <- sapply(n, correct)

rownames(results) <- c("CV", "adjR2", "VALID")

colnames(results) <- n

print(results)

The results are given below and are quite interesting:

10 100 1000 10000

CV 0.7260 0.6430 0.6585 0.6405

adjR2 0.6570 0.6735 0.6575 0.6965

VALID 0.6195 0.6275 0.6220 0.6240

Cross-Validation performance is the best for small samples but deteriorates with sample size. Validation Sample approach performance does not change with sample size and is a bit worse than Cross-Validation. Adjusted R-squared method is actually better than other methods for large samples (again - thanks for Andrej-Nikolai Spiess comment).

What is interesting for me is that increasing sample size does not lead to sure selection of the true model for Cross-Validation and Validation Sample methods. And sure - for large samples the estimate of the parameter for variale x2 is near 0 so the mistake is not very significant, but I did not expect such results.

4 comments:

Andrej-Nikolai SpiessDecember 3, 2011 at 5:27 AM
Hi Bogumit,

i made some research on model selection methods for sigmoidal models:

http://www.biomedcentral.com/1471-2210/10/6

There I found that the Information Criteria (AIC, corrected AIC, BIC) outperform most of the other methods. This is also in your case. I modified your code a bit (included BIC, and pre-allocate the list in each loop, so R doesn't have to 'grow' the list). I also think that arsq[[1]] > arsq[[2]] because model 1 should have a higher R-square when correctly selected. Have a look how good BIC performs:

library(boot)
set.seed(1)

decision <- function(n) {
x1 <- runif(n)
x2 <- runif(n)
y <- x1 + rnorm(n)
train.size <- round(0.7 * n)
data.set <- data.frame(x1, x2, y)
train.data <- data.set[1 : train.size,]
validation.data <- data.set[(train.size + 1) : n,]
formulas <- list(y ~ x1, y ~ x1 + x2)
cv <- arsq <- valid <- bic <- anov <- vector("list", length = 2)
for (i in 1:2) {
cv[[i]] <- cv.glm(data.set, glm(formulas[[i]]), K = 5)$delta[1]
LM <- lm(formulas[[i]])
anov[[i]] <- LM
arsq[[i]] <- summary(LM)$adj.r.squared
bic[[i]] <- BIC(LM)
valid.lm <- lm(formulas[[i]], data = train.data)
valid[[i]] <- mean((predict(valid.lm, validation.data) - validation.data$y)^2)
}
return(c(cv[[1]] < cv[[2]],
arsq[[1]] > arsq[[2]],
valid[[1]] < valid[[2]],
bic[[1]] < bic[[2]]))
}

correct <- function(n) {
rowMeans(replicate(200, decision(n)))
}

n <- c(10, 100, 1000)
results <- sapply(n, correct)
rownames(results) <- c("CV", "adjR2", "VALID", "BIC")
colnames(results) <- n
print(results)

10 100 1000
CV 0.730 0.620 0.600
adjR2 0.660 0.705 0.650
VALID 0.635 0.600 0.595
BIC 0.770 0.950 0.985
ReplyDelete
Replies
Bogumił KamińskiDecember 3, 2011 at 5:52 AM
Thanks Andrej. And I will correct the original post with Adjusted R-squared mistake.
ReplyDelete
Replies
AnonymousDecember 4, 2011 at 11:56 PM
Adding the following rule:

pval<- summary(lm(formulas[[2]],data=data.set))$coef[3,4]

pval>0.05

Imediately gives rates of correct model close to 1. I recently read the article:
http://www.ecb.int/pub/pdf/scpwps/ecbwp195.pdf, which states that in-sample significance testing outperforms out-of-sample statistics. This example illustrates this point.
ReplyDelete
Replies
Andrej-Nikolai SpiessDecember 5, 2011 at 11:32 AM
Another possibility is to do nested F-tests to see if the additional parameter x2 adds significantly in reducing the residual sum-of-squares:

anova(lm(formulas[[1]]), lm(formulas[[2]]))[2, 6]
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.