The figure shows RSS for all possible linear regressions for prostate cancer data on training subset. The standard approach for such a problem in R is to use leaps package, but I simply wanted to test my function generating all subsets of the set.
Here is the code with all.subsets function generating all subsets and its application to prostate cancer data:
library(plyr)
library(ggplot2)
all.subsets <- function(set) {
n <- length(set)
bin <- expand.grid(rlply(n, c(F, T)))
mlply(bin, function(...) { set[c(...)] })
}
file.url <- "http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data"
data.set <- read.table(file.url, sep = "\t", header = TRUE)
varlist <- all.subsets(names(data.set)[2:9])
get.reg <- function(vars) {
if (length(vars) == 0) {
vars = "1"
}
vars.form <- paste("lpsa ~
", paste(vars, collapse = " + "))
lm(vars.form, data = data.set, subset = train)
}
models <- llply(varlist, get.reg)
models.RSS <- ldply(models, function(x) {
c(RSS = sum(x$residuals ^ 2), k = length(x$coeff)) })
min.models <- ddply(models.RSS, .(k), function(x) {
x[which.min(x$RSS),]})
qplot(k, RSS, data = models.RSS, ylim
= c(0,100),
xlab = "Subset Size k", ylab = "Residual
Sum-of-Squares",) +
geom_point(data = min.models, aes(x = k, y = RSS), colour = "red") +
geom_line(data = min.models, aes(x = k, y = RSS), colour = "red") +
theme_bw()
And here is the plot it generates:
How does the time on this compare to combn? I've been curious about alternatives to combn - particularly ones that (like anything with plyr) can use multiple cores for functions that take a looooong time when doing all possible subsets kinds of things.
ReplyDeleteIf you care about speed the following code is much faster for large sets:
ReplyDeleteall.subsets.fast <- function(set) {
n <- length(set)
bin <- vector(mode = "list", length = n)
for (i in 1L:n) {
bin[[i]] <- rep.int(c(rep.int(F, 2L ^ (i - 1L)),
rep.int(T, 2L ^ (i - 1L))),
2L ^ (n - i))
}
apply(do.call(cbind, bin), 1L, function(x) { set[x] } )
}
However, as you can see, it is more complex.
Hi, your code is much faster than the set_power() in sets package, but when I try all.subsets.fast(seq(1:50)), it says:
DeleteError: cannot allocate vector of size 4194304.0 Gb. How can I fix it? Thanks!
Such a large structure simply does not fit into memory.
DeleteYou would have to use generator rather than materialized structure. Anyway 2^50=1125899906842624 so you will not be able to iterate over such a large number of elements anyway.
Thank you very much! I had attempted to generate all subsets of {1,...,200} to test an assumption of EBIC, but I just found it would take at least 2^200 bytes to store its power sets. What should I read to understand your all.subsets.fast() function? What should I read to understand "generator"? I am a grad student in stats, have a bsc in math. Thank you very much!
DeleteMy function: unfortunately :(, I think that the best advice is to read documentation of all functions I use in R help.
DeleteGenerators: https://en.wikipedia.org/wiki/Generator_(computer_programming)
Thank you very much for your help! I will try to understand this function.
Delete