Interestingly this problem can be simply solved using basic rep and matrix functions:
f <- factor(sample(c("A", "B", "C"), 8, replace = TRUE))
matrix(as.integer(rep(levels(f), each = length(f)) == f),
nrow = length(f), dimnames = list(f, levels(f)))In the code we use the fact that R automatically recycles f in comparison. However, classvec2classmat is faster than the solution proposed here. This is easly checked using system.time. On my computer it is roughly two times faster for large number of observations.
Both codes are fast enough for practical applications. However, I wanted to understand the reasons of this speed difference, so I checked out classvec2classmat source:
function (yvec) {
yvec <- factor(yvec)
nclasses <- nlevels(yvec)
outmat <- matrix(0, length(yvec), nclasses)
dimnames(outmat) <- list(NULL, levels(yvec))
for (i in 1:nclasses) outmat[which(as.integer(yvec) == i), i] <- 1
outmat
}
The performance gain is due to two reasons:
- my code compares factors not integers (this could be simply fixed, but does not fully solve the problem);
- classvec2classmat uses assignment operation only for indices that need to be set to 1, whereas my code first creates a vector using rep and then transforms it.
Note that you can use the matrix indexing to do this fairly succinctly. Using integers also reduces storage space.
ReplyDelete> x = factor(sample(letters, 500000, T))
> system.time(m1 <- f1(x))
user system elapsed
0.33 0.09 0.42
> system.time(m1 <- f2(x))
user system elapsed
0.11 0.01 0.12
f2 = function(x) replace(matrix(0L, length(x), nlevels(x), dimnames=list(NULL, levels(x))), cbind(1:length(x), as.integer(x)), 1L)
Really nice solution and over two times faster than classvec2classmat.
ReplyDeleteI did not know that you can use replace on matrices - it is not obvious from documentation, as is.vector returns FALSE for matrices. However, I checked replace source - and it is clear that it works!