R snippets: Generating all subsets of a set

Friday, April 20, 2012

Generating all subsets of a set

Recently I have calculated Banzhaf power index. I required generation of all subsets of a given set. The code given there was a bit complex and I have decided to write a simple function calculating it. As an example of its application I reproduce Figure 3.5 from Hastie et al. (2009).
The figure shows RSS for all possible linear regressions for prostate cancer data on training subset. The standard approach for such a problem in R is to use leaps package, but I simply wanted to test my function generating all subsets of the set.

Here is the code with all.subsets function generating all subsets and its application to prostate cancer data:

library(plyr)

library(ggplot2)

all.subsets <- function(set) {

n <- length(set)

bin <- expand.grid(rlply(n, c(F, T)))

mlply(bin, function(...) { set[c(...)] })

}

file.url <- "http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data"

data.set <- read.table(file.url, sep = "\t", header = TRUE)

varlist <- all.subsets(names(data.set)[2:9])

get.reg <- function(vars) {

if (length(vars) == 0) {

vars = "1"

}

vars.form <- paste("lpsa ~ ", paste(vars, collapse = " + "))

lm(vars.form, data = data.set, subset = train)

}

models <- llply(varlist, get.reg)

models.RSS <- ldply(models, function(x) {

c(RSS = sum(x$residuals ^ 2), k = length(x$coeff)) })

min.models <- ddply(models.RSS, .(k), function(x) {

x[which.min(x$RSS),]})

qplot(k, RSS, data = models.RSS, ylim = c(0,100),

xlab = "Subset Size k", ylab = "Residual Sum-of-Squares",) +

geom_point(data = min.models, aes(x = k, y = RSS), colour = "red") +

geom_line(data = min.models, aes(x = k, y = RSS), colour = "red") +

theme_bw()

And here is the plot it generates:

7 comments:

jebyrnesApril 23, 2012 at 2:40 PM
How does the time on this compare to combn? I've been curious about alternatives to combn - particularly ones that (like anything with plyr) can use multiple cores for functions that take a looooong time when doing all possible subsets kinds of things.
ReplyDelete
Replies
Bogumił KamińskiApril 24, 2012 at 2:31 PM
If you care about speed the following code is much faster for large sets:

all.subsets.fast <- function(set) {
n <- length(set)
bin <- vector(mode = "list", length = n)
for (i in 1L:n) {
bin[[i]] <- rep.int(c(rep.int(F, 2L ^ (i - 1L)),
rep.int(T, 2L ^ (i - 1L))),
2L ^ (n - i))
}
apply(do.call(cbind, bin), 1L, function(x) { set[x] } )
}

However, as you can see, it is more complex.
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.