Friday, May 16, 2014

RGolf

Its time for some fun today - because its Friday as David Smith says :).

There are many code golf sites, even some support R. However, most of them are algorithm oriented. A true RGolf competition should involve transforming a source data frame to some target format data frame.

So the challenge today will be to write a shortest code in R that performs a required data transformation.

Let's start with the data transformation task (actually the problem was taken from a real data set I have recently analyzed).

We are running a survey. Each respondent is asked some subset of possible questions (labelled by letters) and answers the question positively (1) or negatively (0). As input we are given a data frame with two columns: labels of questions asked (as letters) and sequence of answers given to them (string of 0's and 1's). A good R example is better than 1000 words :):

> set.seed(1)
> questions <- replicate(1000, paste(sample(letters[1:10],
      sample.int(4) + 2), collapse = ""))
> answers <- sapply(questions, function(x) {
      paste(as.character(rbinom(nchar(x), 1, 0.5)),
      collapse = "") })
> dataset <- data.frame(questions, answers,
      stringsAsFactors = FALSE)
> head(dataset)
  questions answers
1      cihe    1100
2     gdjie   01100
3    cfbhja  001000
4      febj    1110
5     ehfid   01101
6     hgdic   10010

We will want to transform dataset to the following wide format (stored in dataset2):

> head(dataset2)
   a  b  c  d  e  f  g  h  i  j
1 NA NA  1 NA  0 NA NA  0  1 NA
2 NA NA NA  1  0 NA  0 NA  0  1
3  0  1  0 NA NA  0 NA  0 NA  0
4 NA  1 NA NA  1  1 NA NA NA  0
5 NA NA NA  1  0  1 NA  1  0 NA
6 NA NA  0  0 NA NA  0  1  1 NA

The challenge is to transform dataset in such a way to generate dataset2 in as few keystrokes as possible, assuming that number of questions and number of respondents (respectively equal to 10 and 1000 in example data set) is unknown. The constraints are that one line of code may not be longer than 80 characters and the solution must be in base R only (no package loading is allowed).

Here is my attempt:

d<-dataset;y<-sort(unique(strsplit(paste(d[[1]],collapse=""),"")[[1]]))
d2<-data.frame(t(mapply(function(q,a){r<-rep(NA,length(y))
r[grepl(paste("[",q,"]",sep=""),y)]<-as.numeric(strsplit(a,split="")[[1]][
order(strsplit(q,split="")[[1]])]);names(r)<-y;r},d[[1]],d[[2]],USE.NAMES=F)))

It has 284 characters (including 3 newline characters). If you take the challenge and have a shorter solution that produces exactly the same dataset2 data set for a given input post a comment ;). In order for the comment to be accepted the solution must be robust to changes of generated data set (different number of possible questions and answers).

Before I quit I present the same code in slightly more readable format and commented:

# extract all classes that exist in dataset$questions
# and sort them
classes <- sort(unique(strsplit(paste(dataset$questions,
    collapse = ""), "")[[1]]))

# change one pair of questions and answers into
# a full vector containing all classes sorted
process.qa <- function(q, a) {
    res <- rep(NA, length(classes)) # initially no classes are set
    qs <- strsplit(q, split="")[[1]] # extract question classes
    # extract answers and sort them in order of question classes
    as <- as.numeric(strsplit(a, split="")[[1]][order(qs)])
    # update result with answers for existing questions
    res[grepl(paste("[",q, "]", sep=""), classes)] <- as
    names(res) <- classes
    res
}

dataset2 <- data.frame(t(mapply(process.qa,
    dataset$questions, dataset$answers, USE.NAMES = F)))

11 comments:

  1. 169 with plenty of warnings for being naughty....
    a<-(e<-strsplit)((d<-dataset)$q,'');b<-(h<-lapply)(e(d$a,''),as.integer)
    x<-unique(sort(unlist(a)))
    y<-h(a, `match`, x=x);g<-`names<-`(data.frame(t(mapply('[', b,y))),x)

    > head(g)
    a b c d e f g h i j
    1 NA NA 1 NA 0 NA NA 0 1 NA
    2 NA NA NA 1 0 NA 0 NA 0 1
    3 0 1 0 NA NA 0 NA 0 NA 0
    4 NA 1 NA NA 1 1 NA NA NA 0
    5 NA NA NA 1 0 1 NA 1 0 NA
    6 NA NA 0 0 NA NA 0 1 1 NA
    > head(g)
    a b c d e f g h i j
    1 NA NA 1 NA 0 NA NA 0 1 NA
    2 NA NA NA 1 0 NA 0 NA 0 1
    3 0 1 0 NA NA 0 NA 0 NA 0
    4 NA 1 NA NA 1 1 NA NA NA 0
    5 NA NA NA 1 0 1 NA 1 0 NA
    6 NA NA 0 0 NA NA 0 1 1 NA

    ReplyDelete
  2. 253 with a matrix-multiplication approach. Nice little Friday challenge, thanks for posting!

    d=dataset;names(d)=c('q','a');e=nrow(d) #nchar:39+1
    h=apply(d[1:e,],1,function(r)data.frame(strsplit(r,''),stringsAsFactors=F)) #nchar:75+1
    w=do.call(rbind,h);w$w=floor(as.numeric(rownames(w))) #nchar:53+1
    x=xtabs(as.numeric(a)~w+q,w);y=xtabs(~w+q,w);y[y<1]=NA #nchar:54+1
    d2=as.data.frame.matrix(x*y) #nchar:28

    > head(d2)
    a b c d e f g h i j
    1 NA NA 1 NA 0 NA NA 0 1 NA
    2 NA NA NA 1 0 NA 0 NA 0 1
    3 0 1 0 NA NA 0 NA 0 NA 0
    4 NA 1 NA NA 1 1 NA NA NA 0
    5 NA NA NA 1 0 1 NA 1 0 NA
    6 NA NA 0 0 NA NA 0 1 1 NA

    ReplyDelete
  3. 211 (incl. newlines) :-)

    d=dataset;l=letters;s=substr
    n=which(l==rev(sort(unlist(strsplit(d[,1],""))))[1])
    d2=t(apply(d,1,function(a){b=rep(NA,n);names(b)=l[1:n]
    sapply(1:nchar(a[1]),function(i)b[s(a[1],i,i)]<<-strtoi(s(a[2],i,i)));b}))

    > head(d2)
    a b c d e f g h i j
    [1,] NA NA 1 NA 0 NA NA 0 1 NA
    [2,] NA NA NA 1 0 NA 0 NA 0 1
    [3,] 0 1 0 NA NA 0 NA 0 NA 0
    [4,] NA 1 NA NA 1 1 NA NA NA 0
    [5,] NA NA NA 1 0 1 NA 1 0 NA
    [6,] NA NA 0 0 NA NA 0 1 1 NA

    PS. My solution yields a matrix. If a data frame is required, it'll be 12 characters longer.

    ReplyDelete
  4. 146 characters [and a number of warnings]

    d <- dataset
    d2 <- t(sapply(strsplit(mapply(function(a,b) chartr(a, b, "abcdefghij"), d[,1], d[,2]), ""), as.integer))
    colnames(d2)<-letters[1:10]

    This solution is a matrix, and has useless (but harmless) rownames.

    Another one:

    d2 <- apply(do.call(rbind, strsplit(apply(dataset, 1, function(x) chartr(x[1], x[2], "abcdefghij")), "")), 1:2, as.integer)
    colnames(d2)<-letters[1:10]

    ReplyDelete
  5. 139 (30+26+72+11) including new lines. no warning.

    z=lapply(dataset,strsplit,'')
    l=sort(Reduce(union,z$q))
    d2=data.frame(t(mapply(function(i,j)as.double(j[match(l,i)]),z$q,z$a)))
    names(d2)=l

    The output is identical to Bogumił's, according to identical().

    ReplyDelete
    Replies
    1. You can save 10 characters by replacing the second line by
      l=letters[1:10]

      Delete
    2. Then solution would not be robust to the change of number of possible questions :).
      Anyway - amazing job.

      Delete
  6. 116 I think:

    d=dataset;d3=tapply(u(s(d[[2]],''))=='1',list(rep(1:nrow(d),nchar(d[[1]])),(u=unlist)((s=strsplit)(d[[1]],''))),c)+0

    ReplyDelete
    Replies
    1. Shave off another 3:

      d=dataset;d3=+tapply(u(s(d[[2]],''))==1,list(rep(1:nrow(d),nchar(d[[1]])),(u=unlist)((s=strsplit)(d[[1]],''))),c)

      Delete
    2. Using $ with partial matching is much better at 104:

      d=dataset;d3=+tapply(u(s(d$a,''))==1,list(rep(1:nrow(d),nchar(d$q)),(u=unlist)((s=strsplit)(d$q,''))),c)

      Delete
    3. excellent! one char shorter version ;-)

      d3=+tapply(u(s(d$a,''))==1,list(rep(1:nrow(d<-dataset),nchar(d$q)),(u=unlist)((s=strsplit)(d$q,''))),c)

      Delete

Note: Only a member of this blog may post a comment.