Here is the code that generates the required comparison assuming that x ~ U(0, 1) and e ~ N(0, 1) and two example levels of training sample size 20 and 200.
library(rpart)
library(nnet)
run <- function(n) {
x <- runif(n)
y <- 10 * x + rnorm(n)
new.x <- data.frame(x = seq(0, 1, len = 10000))
models <- list(linear = lm(y ~ x),
tree = rpart(y ~ x),
nnet2 = nnet(y ~ x, size = 2,
trace = F, linout = T),
nnet10 = nnet(y ~ x, size = 10,
trace = F, linout = T))
sapply(models, function(model) {
pred <- predict(model, newdata = new.x)
sum((pred - 10 * new.x$x) ^ 2)
})
}
set.seed(1)
for (n in c(20, 200)) {
cat("--- n
=", n, "---\n")
print(summary(t(replicate(100, run(n)))))
}
# --- n = 20 ---
# linear tree nnet2 nnet10
#
Min. : 17.32
Min. :21046 Min.
: 322.9 Min.
: 566
#
1st Qu.: 247.25 1st
Qu.:22562 1st Qu.: 1753.1 1st Qu.:
5759
#
Median : 725.22 Median
:24537 Median : 3419.2 Median :
10961
#
Mean :1071.07 Mean
:25644 Mean : 7221.4
Mean : 87200
#
3rd Qu.:1651.43 3rd
Qu.:27559 3rd Qu.: 6877.1 3rd Qu.:
22494
#
Max. :6614.57 Max.
:40742 Max. :84169.8
Max. :4309641
# --- n = 200 ---
# linear tree nnet2 nnet10
#
Min. : 1.107
Min. :1976 Min.
: 32.62 Min.
: 119.7
#
1st Qu.: 25.939 1st
Qu.:2851 1st Qu.: 183.82
1st Qu.: 313.4
#
Median : 76.533 Median
:3366 Median : 293.65
Median : 531.5
#
Mean :112.766 Mean
:3490 Mean : 2008.36
Mean : 2211.1
#
3rd Qu.:160.217 3rd
Qu.:3921 3rd Qu.: 479.10
3rd Qu.: 742.3
#
Max. :568.374 Max.
:6502 Max. :83603.10
Max. :83444.6It is simple that linear regression is optimal as it is properly specified. Next in general neural net with size 2, neural net with size 10 and regression tree follow. The reason is that neural nets use S-shaped transformations and have effectively more parameters than are needed to fit the relationship. Finally regression tree is simply not well suited for modeling linear relationships between variables.
However, neural nets are initialized using random parameters and sometimes BFGS optimization fails and very poor fits can occur. This can be seen by large values of Max. in nnet2 and nnet10. The median of the results is largely unaffected by this but evaluation of mean expected error is very unstable due to the outliers (in order to get more reliable estimates more than 100 replications are needed).
Of course by modifying rpart or nnet one can get a bit different results but the general conclusions will be similar.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.