To test this I have thought to generate data following the process y = x1 + ε and test two linear models. In the first one only intercept and parameter for x1 are estimated and in the second a random noise variable x2 is added.
I decided to compare 5-fold Cross-Validation, Adjusted R-squared and Validation Sample (with 70/30 split) methods by testing their ability to select the true model - i.e. the one without x2 variable. I run the test assuming initial sample size equal to 10, 100, 1 000 and 10 000. Here goes my code (corrected after Andrej-Nikolai Spiess comment):
library(boot)
set.seed(1)
decision <- function(n) {
x1 <- runif(n)
x2 <- runif(n)
y <- x1 + rnorm(n)
train.size <- round(0.7 * n)
data.set <- data.frame(x1, x2, y)
train.data <- data.set[1 : train.size,]
validation.data <- data.set[(train.size + 1) : n,]
formulas <- list(y ~ x1, y ~ x1 + x2)
cv <- arsq <- valid <- list()
for (i in 1:2) {
cv[[i]] <- cv.glm(data.set, glm(formulas[[i]]),
K = 5)$delta[1]
K = 5)$delta[1]
arsq[[i]] <- summary(lm(formulas[[i]]))$adj.r.squared
valid.lm <- lm(formulas[[i]], data = train.data)
valid[[i]] <- mean((predict(valid.lm, validation.data)
- validation.data$y)^2)
}
return(c(cv[[1]] < cv[[2]],
arsq[[1]] > arsq[[2]],
valid[[1]] < valid[[2]]))
}
correct <- function(n) {
rowMeans(replicate(2000, decision(n)))
}
n <- c(10, 100, 1000, 10000)
results <- sapply(n, correct)
rownames(results) <- c("CV", "adjR2", "VALID")
colnames(results) <- n
print(results)
The results are given below and are quite interesting:
10 100 1000 10000
CV 0.7260 0.6430 0.6585 0.6405
adjR2 0.6570 0.6735 0.6575 0.6965
VALID 0.6195 0.6275 0.6220 0.6240
Cross-Validation performance is the best for small samples but deteriorates with sample size. Validation Sample approach performance does not change with sample size and is a bit worse than Cross-Validation. Adjusted R-squared method is actually better than other methods for large samples (again - thanks for Andrej-Nikolai Spiess comment).
What is interesting for me is that increasing sample size does not lead to sure selection of the true model for Cross-Validation and Validation Sample methods. And sure - for large samples the estimate of the parameter for variale x2 is near 0 so the mistake is not very significant, but I did not expect such results.
Hi Bogumit,
ReplyDeletei made some research on model selection methods for sigmoidal models:
http://www.biomedcentral.com/1471-2210/10/6
There I found that the Information Criteria (AIC, corrected AIC, BIC) outperform most of the other methods. This is also in your case. I modified your code a bit (included BIC, and pre-allocate the list in each loop, so R doesn't have to 'grow' the list). I also think that arsq[[1]] > arsq[[2]] because model 1 should have a higher R-square when correctly selected. Have a look how good BIC performs:
library(boot)
set.seed(1)
decision <- function(n) {
x1 <- runif(n)
x2 <- runif(n)
y <- x1 + rnorm(n)
train.size <- round(0.7 * n)
data.set <- data.frame(x1, x2, y)
train.data <- data.set[1 : train.size,]
validation.data <- data.set[(train.size + 1) : n,]
formulas <- list(y ~ x1, y ~ x1 + x2)
cv <- arsq <- valid <- bic <- anov <- vector("list", length = 2)
for (i in 1:2) {
cv[[i]] <- cv.glm(data.set, glm(formulas[[i]]), K = 5)$delta[1]
LM <- lm(formulas[[i]])
anov[[i]] <- LM
arsq[[i]] <- summary(LM)$adj.r.squared
bic[[i]] <- BIC(LM)
valid.lm <- lm(formulas[[i]], data = train.data)
valid[[i]] <- mean((predict(valid.lm, validation.data) - validation.data$y)^2)
}
return(c(cv[[1]] < cv[[2]],
arsq[[1]] > arsq[[2]],
valid[[1]] < valid[[2]],
bic[[1]] < bic[[2]]))
}
correct <- function(n) {
rowMeans(replicate(200, decision(n)))
}
n <- c(10, 100, 1000)
results <- sapply(n, correct)
rownames(results) <- c("CV", "adjR2", "VALID", "BIC")
colnames(results) <- n
print(results)
10 100 1000
CV 0.730 0.620 0.600
adjR2 0.660 0.705 0.650
VALID 0.635 0.600 0.595
BIC 0.770 0.950 0.985
Thanks Andrej. And I will correct the original post with Adjusted R-squared mistake.
ReplyDeleteAdding the following rule:
ReplyDeletepval<- summary(lm(formulas[[2]],data=data.set))$coef[3,4]
pval>0.05
Imediately gives rates of correct model close to 1. I recently read the article:
http://www.ecb.int/pub/pdf/scpwps/ecbwp195.pdf, which states that in-sample significance testing outperforms out-of-sample statistics. This example illustrates this point.
Another possibility is to do nested F-tests to see if the additional parameter x2 adds significantly in reducing the residual sum-of-squares:
ReplyDeleteanova(lm(formulas[[1]]), lm(formulas[[2]]))[2, 6]