Monday, May 20, 2013

Model fitting exam problem

Recently I have run an exam where the following question had risen many problems for students (here I give its shortened formulation). You are given the data generating process y = 10x + e, where e is error term. Fit linear regression using lm, neural net using nnet with size equal to 2 and 10 and regression tree using rpart. What can be said about distribution of prediction error of such four modeling techniques?

Here is the code that generates the required comparison assuming that x ~ U(0, 1) and e ~ N(0, 1) and two example levels of training sample size 20 and 200.

library(rpart)
library(nnet)

run <- function(n) {
    x <- runif(n)
    y <- 10 * x + rnorm(n)
    new.x <- data.frame(x = seq(0, 1, len = 10000))
    models <- list(linear = lm(y ~ x),
                   tree   = rpart(y ~ x),
                   nnet2  = nnet(y ~ x, size = 2,
                                 trace = F, linout = T),
                   nnet10 = nnet(y ~ x, size = 10,
                                 trace = F, linout = T))
    sapply(models, function(model) {
        pred <- predict(model, newdata = new.x)
        sum((pred - 10 * new.x$x) ^ 2)
    })
}

set.seed(1)
for (n in c(20, 200)) {
    cat("--- n =", n, "---\n")
    print(summary(t(replicate(100, run(n)))))
}

# --- n = 20 ---
#      linear             tree           nnet2             nnet10    
#  Min.   :  17.32   Min.   :21046   Min.   :  322.9   Min.   :    566
#  1st Qu.: 247.25   1st Qu.:22562   1st Qu.: 1753.1   1st Qu.:   5759
#  Median : 725.22   Median :24537   Median : 3419.2   Median :  10961
#  Mean   :1071.07   Mean   :25644   Mean   : 7221.4   Mean   :  87200
#  3rd Qu.:1651.43   3rd Qu.:27559   3rd Qu.: 6877.1   3rd Qu.:  22494
#  Max.   :6614.57   Max.   :40742   Max.   :84169.8   Max.   :4309641
# --- n = 200 ---
#      linear             tree          nnet2              nnet10    
#  Min.   :  1.107   Min.   :1976   Min.   :   32.62   Min.   :  119.7
#  1st Qu.: 25.939   1st Qu.:2851   1st Qu.:  183.82   1st Qu.:  313.4
#  Median : 76.533   Median :3366   Median :  293.65   Median :  531.5
#  Mean   :112.766   Mean   :3490   Mean   : 2008.36   Mean   : 2211.1
#  3rd Qu.:160.217   3rd Qu.:3921   3rd Qu.:  479.10   3rd Qu.:  742.3
#  Max.   :568.374   Max.   :6502   Max.   :83603.10   Max.   :83444.6

It is simple that linear regression is optimal as it is properly specified. Next in general neural net with size 2, neural net with size 10 and regression tree follow. The reason is that  neural nets use S-shaped transformations and have effectively more parameters than are needed to fit the relationship. Finally regression tree is simply not well suited for modeling linear relationships between variables.

However, neural nets are initialized using random parameters and sometimes BFGS optimization fails and very poor fits can occur. This can be seen by large values of Max. in nnet2 and nnet10. The median of the results is largely unaffected by this but evaluation of mean expected error is very unstable due to the outliers (in order to get more reliable estimates more than 100 replications are needed).

Of course by modifying rpart or nnet one can get a bit different results but the general conclusions will be similar.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.