Saturday, June 14, 2014

RGolf: NGSL Scrabble

It is last part of RGolf before summer. As R excels in visualization capabilities today the task will be to generate a plot.

We will work with NGSL data - a list of 2801 important vocabulary words for students of English as a second language. I have prepared the list as a NGSL101.txt file for download.

Let us move to the task. Load NGSL101.txt into R. For each word in the list we want to calculate the number of other words from the list that can be arranged from a subset of letters from the original words (like in scrabble). For example for word "shoot" we have the following list of its subwords: "to", "so", "too", "hot", "shoot", "host" and "shot". As a product we want to plot the relationship between number of letters in the word and the logarithm of number of its subwords.

The rules of the game are standard:
(1) generate the plot in as few keystrokes as possible;
(2) plot formatting (e.g. x and y axis titles, type of plot) is not important;
(3) one line of code may not be longer than 80 characters;
(4) the solution must be in base R only (no package loading is allowed);
(5) assume that you have NGSL101.txt file in your R working directory.
As always - if you have a nice solution please submit a comment :).

Warning! This time the task takes a bit more time to compute so it is worth to do the development and testing of the solution on the subset of NGSL word list.

Here is my attempt in 169 keystrokes:

d=scan("NGSL101.txt","",skip=1)
a=s((s=sapply)(strsplit(d,""),sort),paste,collapse=".*")
y=log(s(a,function(z)sum(s(a,function(i)grepl(i,z)))))
plot(by(y,nchar(d),mean))

And the output is the following:

As we can see the number of subwords approximately on the average grows exponentially with the number of letters in a word.

And here is a verbose version of the solution with comments (warning again - it is slower than the solution given above):

d <- d[-1] # remove first line from the dataset as it is a comment

is.subword <- function(test, ref) {
# we check if test is a subword of ref by applying regular
# expression on sorted letters contained in both words
test <- paste(sort(strsplit(test, "")[[1]]),collapse=".*")
ref <-  paste(sort(strsplit(ref, "")[[1]]),collapse="")
# grepl returns true is match is found
grepl(test, ref)
}

# traverse all words in d and count number of matches
count.subwords <- function(ref) {
sum(sapply(d, is.subword, ref = ref))
}

x2 <- nchar(d)
y2 <- log(sapply(d, count.subwords))
y2.means <- tapply(y2, x2, mean)
plot(y2.means)

1. Hi,

Your use of regular expressions is very clever in this! However, each time you call grepl(), the regexp engine has to "compile" the pattern into some efficient internal representation to parse the input. This is the reason your code takes so much time. You can speed it up: instead of using many different patterns on a single input, it is better to use one pattern on many inputs. Pay attention to the last two lines (164 characters):

d=scan("NGSL101.txt","",skip=1)
a=(s=sapply)(s(strsplit(d,""),sort),paste,collapse=".*")
y=rep(0,length(a));for(z in a)y=y+grepl(z,a)
plot(by(log(y),nchar(d),mean))

Here grepl(z,a) returns the array of flags, each telling whether all letters in the word "z" are in a corresponding word in "a". The loop just accumulates the number of sub-words for each word.

I''d like to add that in the original code you don''t have to create an anonymous function to apply grepl():

y = log( s( a, function( z ) sum( s( a, grepl, z ) ) ) )

And if you''re willing to use more memory (and more time) you can do this:

y = log( apply( s( a, grepl, a ), 1, sum ) )

Here s( a, grepl, a ) computes a useful matrix of the "being-a-superword" binary relation among words, where (R,C) is TRUE if C is a sub-word of R.

My own solution is longer (178 characters) and not as fast:

l=strsplit(scan("NGSL101.txt","",skip=1),"")
x=(s=sapply)(s(l,factor,unique(unlist(l))),summary)
y=(a=apply)(x,2,function(v)sum(a(v>=x,2,all)))
plot(by(log(y2),s(l,length),mean))

It builds a matrix of the stock of letters in each word (x), and then counts the words from the dictionary, which fit in the current "word".

2. My solution is also based on counting letters, but a bit shorter - 166 characters, and faster. However, it works only for words consisting of letters from english alphabet (because of "c(letters,LETTERS").

d=scan("NGSL101.txt","",sk=1)
s=sapply
a=s(s(strsplit(d,""),factor,c(letters,LETTERS)),table)
plot(by(log(apply(a,2,function(w)sum(colSums(a<=w)>51))),nchar(d),mean))

1. Actually my solution can be made alphabet-independent by adding only 3 characters. Now it's almost the same as Ivan's version, but thanks to a few tricks a bit shorter (169 characters)

d=strsplit(scan("NGSL101.txt","",sk=1),"")
s=sapply
a=apply
b=s(s(d,factor,unique(unlist(d))),table)
plot(by(log(a(b,2,function(w)sum(a(b<=w,2,all)))),s(d,length),mean))

2. nice :)
However, for my R configuration "sk=1" does not work and I have to do full "skip=1".

3. Hmm I've got no problem with "sk=1". Does it throw an error or something?

4. Yes - it is an error, because "scan" has "skip" and "skipNul" arguments so the call is ambiguous (at least under R version 3.1.0).

5. Well, that is another reason why you should always use full argument names. Under R 3.0.1 "scan" doesn't have "skipNul" argument.