I compare stability of logistic regression and classification tree on Participation data set from Ecdat package. The method works as follows:

- Divide the data into training and test data set;
- Generate a random subset of training data and build logistic regression and classification tree using them;
- Apply the models on test data to obtain predicted probabilities;
- Repeat steps 2 and 3 many times;
- For each observation in test data set calculate standard deviation of obtained predictions for both classes of models;
- For both models plot kernel density estimator of standard deviation distribution in test data set.

library

**(**party**)**library

**(**Ecdat**)**data

**(**Participation**)**set.seed

**(**1**)**shuffle

**<-**Participation**[**sample**(**nrow**(**Participation**))**,**]**test

**<-**shuffle**[**1**:**300,**]**train

**<-**shuffle**[**301**:**nrow**(**Participation**)**,**]**reps

**<-**1000p.tree

**<-**p.log**<-**vector**(**"list", reps**)****for**

**(**i

**in**1

**:**reps

**)**

**{**

train.sub

**<-**train**[**sample**(**nrow**(**train**))[**1**:**300**]**,**]** mtree

**<-**ctree**(**lfp**~**., data**=**train.sub**)** mlog

**<-**glm**(**lfp**~**., data**=**train.sub, family**=**binomial**)** p.tree

**[[**i**]]****<-**sapply**(**treeresponse**(**mtree, newdata**=**test**)**,**function**

**(**x

**)**

**{**x

**[**2

**]**

**})**

p.log

**[[**i**]]****<-**predict**(**mlog, newdata**=**test, type**=**"response"**)****}**

plot

**(**density**(**apply**(**do.call**(**rbind, p.log**)**, 2, sd**))**, main

**=**"", xlab**=**"sd"**)**lines

**(**density**(**apply**(**do.call**(**rbind, p.tree**)**, 2, sd**))**, col**=**"red"**)**legend

**(**"topright", legend**=**c**(**"logistic", "tree"**)**, col

**=**c**(**"black","red"**)**, lty**=**1**)**And here is the generated comparison. As it can be clearly seen logistic regression gives much more stable predictions in comparison to classification tree.

A very nice illustration! It'd be interesting to know whether the difference between the two models changes with different parameterizations of the models, i.e., when the models become more, or less, complex.

ReplyDeleteKay

Great post,

ReplyDeleteBTW, shouldn't sample have the repeat parameter turned on TRUE ?

Tal,

ReplyDeleteI understand you mean this part of code:

sample(nrow(train))[1:300]I also checked generating standard bootstrap samples with:

sample(nrow(train), rep = TRUE)instead of training data subsamples (as proposed in the post). The results are similar (but not identical).

Hi Bogumił,

ReplyDeleteI suspect that if you are interested in a bootstraping sample, then you should go with -

sample(nrow(train), rep = TRUE)

So to allow the same sample to enter more then once. Otherwise, you are dealing with different allocation of training and testing.

Cheers,

Tal

Tal,

ReplyDeleteYou are right - I would get bootstrapping sample with

sample(nrow(train), rep = TRUE).However, I did not want to do bootstrapping but to get subsamples of training data. I know that such an approach reduces final learning data set size, but it is not crucial for the illustration of relative stability of both models.

In my approach one gets 300 unique observations in learnig data set and 300 unique observations in test data set. I decided to go this way because if one builds a model normally unique observations are used. Of course the multiple learning data sets generated are correlated. The best approach would be to have disjoint learnig data sets - but

Participationdata set does not contain enough observations for that.