To test this I have thought to generate data following the process y = x1 + ε and test two linear models. In the first one only intercept and parameter for x1 are estimated and in the second a random noise variable x2 is added.

I decided to compare 5-fold Cross-Validation, Adjusted R-squared and Validation Sample (with 70/30 split) methods by testing their ability to select the true model - i.e. the one without x2 variable. I run the test assuming initial sample size equal to 10, 100, 1 000 and 10 000. Here goes my code (corrected after Andrej-Nikolai Spiess comment):

library

**(**boot**)**set.seed

**(**1**)**decision

**<-****function****(**n**)****{** x1

**<-**runif**(**n**)** x2

**<-**runif**(**n**)** y

**<-**x1**+**rnorm**(**n**)** train.size

**<-**round**(**0.7*****n**)** data.set

**<-**data.frame**(**x1, x2, y**)** train.data

**<-**data.set**[**1**:**train.size,**]** validation.data

**<-**data.set**[(**train.size**+**1**)****:**n,**]** formulas

**<-**list**(**y**~**x1, y**~**x1**+**x2**)** cv

**<-**arsq**<-**valid**<-**list**()****for**

**(**i

**in**1

**:**2

**)**

**{**

cv

K

**[[**i**]]****<-**cv.glm**(**data.set, glm**(**formulas**[[**i**]])**,K

**=**5**)$**delta**[**1**]** arsq

**[[**i**]]****<-**summary**(**lm**(**formulas**[[**i**]]))$**adj.r.squared valid.lm

**<-**lm**(**formulas**[[**i**]]**, data**=**train.data**)** valid

**[[**i**]]****<-**mean**((**predict**(**valid.lm, validation.data**)****-**validation.data

**$**y

**)^**2

**)**

**}**

return

**(**c**(**cv**[[**1**]]****<**cv**[[**2**]]**, arsq

**[[**1**]]****>**arsq**[[**2**]]**, valid

**[[**1**]]****<**valid**[[**2**]]))****}**

correct

**<-****function****(**n**)****{** rowMeans

**(**replicate**(**2000, decision**(**n**)))****}**

n

**<-**c**(**10, 100, 1000, 10000**)**results

**<-**sapply**(**n, correct**)**rownames

**(**results**)****<-**c**(**"CV", "adjR2", "VALID"**)**colnames

**(**results**)****<-**nprint

**(**results**)**The results are given below and are quite interesting:

10 100 1000 10000

CV 0.7260 0.6430 0.6585 0.6405

adjR2 0.6570 0.6735 0.6575 0.6965

VALID 0.6195 0.6275 0.6220 0.6240

Cross-Validation performance is the best for small samples

*but*deteriorates with sample size. Validation Sample approach performance does not change with sample size and is a bit worse than Cross-Validation. Adjusted R-squared method is actually better than other methods for large samples (again - thanks for Andrej-Nikolai Spiess comment).

What is interesting for me is that increasing sample size does not lead to sure selection of the true model for Cross-Validation and Validation Sample methods. And sure - for large samples the estimate of the parameter for variale x2 is near 0 so the mistake is not very significant, but I did not expect such results.

Hi Bogumit,

ReplyDeletei made some research on model selection methods for sigmoidal models:

http://www.biomedcentral.com/1471-2210/10/6

There I found that the Information Criteria (AIC, corrected AIC, BIC) outperform most of the other methods. This is also in your case. I modified your code a bit (included BIC, and pre-allocate the list in each loop, so R doesn't have to 'grow' the list). I also think that arsq[[1]] > arsq[[2]] because model 1 should have a higher R-square when correctly selected. Have a look how good BIC performs:

library(boot)

set.seed(1)

decision <- function(n) {

x1 <- runif(n)

x2 <- runif(n)

y <- x1 + rnorm(n)

train.size <- round(0.7 * n)

data.set <- data.frame(x1, x2, y)

train.data <- data.set[1 : train.size,]

validation.data <- data.set[(train.size + 1) : n,]

formulas <- list(y ~ x1, y ~ x1 + x2)

cv <- arsq <- valid <- bic <- anov <- vector("list", length = 2)

for (i in 1:2) {

cv[[i]] <- cv.glm(data.set, glm(formulas[[i]]), K = 5)$delta[1]

LM <- lm(formulas[[i]])

anov[[i]] <- LM

arsq[[i]] <- summary(LM)$adj.r.squared

bic[[i]] <- BIC(LM)

valid.lm <- lm(formulas[[i]], data = train.data)

valid[[i]] <- mean((predict(valid.lm, validation.data) - validation.data$y)^2)

}

return(c(cv[[1]] < cv[[2]],

arsq[[1]] > arsq[[2]],

valid[[1]] < valid[[2]],

bic[[1]] < bic[[2]]))

}

correct <- function(n) {

rowMeans(replicate(200, decision(n)))

}

n <- c(10, 100, 1000)

results <- sapply(n, correct)

rownames(results) <- c("CV", "adjR2", "VALID", "BIC")

colnames(results) <- n

print(results)

10 100 1000

CV 0.730 0.620 0.600

adjR2 0.660 0.705 0.650

VALID 0.635 0.600 0.595

BIC 0.770 0.950 0.985

Thanks Andrej. And I will correct the original post with Adjusted R-squared mistake.

ReplyDeleteAdding the following rule:

ReplyDeletepval<- summary(lm(formulas[[2]],data=data.set))$coef[3,4]

pval>0.05

Imediately gives rates of correct model close to 1. I recently read the article:

http://www.ecb.int/pub/pdf/scpwps/ecbwp195.pdf, which states that in-sample significance testing outperforms out-of-sample statistics. This example illustrates this point.

Another possibility is to do nested F-tests to see if the additional parameter x2 adds significantly in reducing the residual sum-of-squares:

ReplyDeleteanova(lm(formulas[[1]]), lm(formulas[[2]]))[2, 6]