The figure shows RSS for all possible linear regressions for prostate cancer data on training subset. The standard approach for such a problem in R is to use leaps package, but I simply wanted to test my function generating all subsets of the set.

Here is the code with all.subsets function generating all subsets and its application to prostate cancer data:

library

**(**plyr**)**
library

**(**ggplot2**)**
all.subsets

**<-****function****(**set**)****{**
n

**<-**length**(**set**)**
bin

**<-**expand.grid**(**rlply**(**n, c**(**F, T**)))**
mlply

**(**bin,**function****(**...**)****{**set**[**c**(**...**)]****})****}**

file.url

**<-**"http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data"
data.set

**<-**read.table**(**file.url, sep**=**"\t", header**=****TRUE****)**
varlist

**<-**all.subsets**(**names**(**data.set**)[**2**:**9**])**
get.reg

**<-****function****(**vars**)****{****if**

**(**length

**(**vars

**)**

**==**0

**)**

**{**

vars

**=**"1"**}**

vars.form

**<-**paste**(**"lpsa ~ ", paste**(**vars, collapse**=**" + "**))**
lm

**(**vars.form, data**=**data.set, subset**=**train**)****}**

models

**<-**llply**(**varlist, get.reg**)**
models.RSS

**<-**ldply**(**models,**function****(**x**)****{**
c

**(**RSS**=**sum**(**x**$**residuals**^**2**)**, k**=**length**(**x**$**coeff**))****})**
min.models

**<-**ddply**(**models.RSS, .**(**k**)**,**function****(**x**)****{**
x

**[**which.min**(**x**$**RSS**)**,**]})**
qplot

**(**k, RSS, data**=**models.RSS, ylim**=**c**(**0,100**)**,
xlab

**=**"Subset Size k", ylab**=**"Residual Sum-of-Squares",**)****+**
geom_point

**(**data**=**min.models, aes**(**x**=**k, y**=**RSS**)**, colour**=**"red"**)****+**
geom_line

**(**data**=**min.models, aes**(**x**=**k, y**=**RSS**)**, colour**=**"red"**)****+**
theme_bw

**()**And here is the plot it generates:

How does the time on this compare to combn? I've been curious about alternatives to combn - particularly ones that (like anything with plyr) can use multiple cores for functions that take a looooong time when doing all possible subsets kinds of things.

ReplyDeleteIf you care about speed the following code is much faster for large sets:

ReplyDeleteall.subsets.fast <- function(set) {

n <- length(set)

bin <- vector(mode = "list", length = n)

for (i in 1L:n) {

bin[[i]] <- rep.int(c(rep.int(F, 2L ^ (i - 1L)),

rep.int(T, 2L ^ (i - 1L))),

2L ^ (n - i))

}

apply(do.call(cbind, bin), 1L, function(x) { set[x] } )

}

However, as you can see, it is more complex.

Hi, your code is much faster than the set_power() in sets package, but when I try all.subsets.fast(seq(1:50)), it says:

DeleteError: cannot allocate vector of size 4194304.0 Gb. How can I fix it? Thanks!

Such a large structure simply does not fit into memory.

DeleteYou would have to use generator rather than materialized structure. Anyway 2^50=1125899906842624 so you will not be able to iterate over such a large number of elements anyway.

Thank you very much! I had attempted to generate all subsets of {1,...,200} to test an assumption of EBIC, but I just found it would take at least 2^200 bytes to store its power sets. What should I read to understand your all.subsets.fast() function? What should I read to understand "generator"? I am a grad student in stats, have a bsc in math. Thank you very much!

DeleteMy function: unfortunately :(, I think that the best advice is to read documentation of all functions I use in R help.

DeleteGenerators: https://en.wikipedia.org/wiki/Generator_(computer_programming)

Thank you very much for your help! I will try to understand this function.

Delete