Friday, December 9, 2011

Stability of classification trees

Classification trees are known to be unstable with respect to training data. Recently I have read an article on stability of classification trees by Briand et al. (2009). They propose a quantitative similarity measure between two trees. The method is interesting and it inspired me to prepare a simple test data based example showing instability of classification trees.

I compare stability of logistic regression and classification tree on Participation data set from Ecdat package. The method works as follows:
  1. Divide the data into training and test data set;
  2. Generate a random subset of training data and build logistic regression and classification tree using them;
  3. Apply the models on test data to obtain predicted probabilities;
  4. Repeat steps 2 and 3 many times;
  5. For each observation in test data set calculate standard deviation of obtained predictions for both classes of models;
  6. For both models plot kernel density estimator of standard deviation distribution in test data set.
The code performing the above steps is as follows:

shuffle <- Participation[sample(nrow(Participation)),]
test <- shuffle[1:300,]
train <- shuffle[301:nrow(Participation),]
reps <- 1000
p.tree <- p.log <- vector("list", reps)

for (i in 1:reps) {
      train.sub <- train[sample(nrow(train))[1:300],]
      mtree <- ctree(lfp ~ ., data = train.sub)
      mlog <- glm(lfp ~ ., data = train.sub, family = binomial)
      p.tree[[i]] <- sapply(treeresponse(mtree, newdata = test),
                                     function(x) { x[2] })
      p.log[[i]] <- predict(mlog, newdata = test, type = "response")
plot(density(apply(, p.log), 2, sd)),
     main="", xlab = "sd")
lines(density(apply(, p.tree), 2, sd)), col="red")
legend("topright", legend = c("logistic", "tree"),
       col = c("black","red"), lty = 1)

And here is the generated comparison. As it can be clearly seen logistic regression gives much more stable predictions in comparison to classification tree.


  1. A very nice illustration! It'd be interesting to know whether the difference between the two models changes with different parameterizations of the models, i.e., when the models become more, or less, complex.


  2. Great post,
    BTW, shouldn't sample have the repeat parameter turned on TRUE ?

  3. Tal,

    I understand you mean this part of code:

    I also checked generating standard bootstrap samples with:
    sample(nrow(train), rep = TRUE)
    instead of training data subsamples (as proposed in the post). The results are similar (but not identical).

  4. Hi Bogumił,
    I suspect that if you are interested in a bootstraping sample, then you should go with -
    sample(nrow(train), rep = TRUE)
    So to allow the same sample to enter more then once. Otherwise, you are dealing with different allocation of training and testing.


  5. Tal,

    You are right - I would get bootstrapping sample with sample(nrow(train), rep = TRUE).

    However, I did not want to do bootstrapping but to get subsamples of training data. I know that such an approach reduces final learning data set size, but it is not crucial for the illustration of relative stability of both models.

    In my approach one gets 300 unique observations in learnig data set and 300 unique observations in test data set. I decided to go this way because if one builds a model normally unique observations are used. Of course the multiple learning data sets generated are correlated. The best approach would be to have disjoint learnig data sets - but Participation data set does not contain enough observations for that.