Wednesday, June 25, 2014

R Scrabble: Part 2

Ivan Nazarov and Bartek Chroł gave very interesting comments to my last post on counting number of subwords in NGSL words. In particular they proposed large speedups of my code. So I thought to try checking a larger data set. So today I will work with TWL2006 - the official word authority for tournament Scrabble in the USA.
The question is whether the exponential relationship between the number of letters in the word and the number of its subwords that is observed in NGSL data set still holds for TWL2006.
The challenge is that NGSL has 2801 words and TWL2006 is much larger with 178691 words. You can download the file TWL2006.txt which contains the words and was prepared (converted to lowercase and sorted by word length) using data from Internet Scrabble Club website.

You could run codes from comments to my last post to obtain the results, but it takes ages to compute (over 1 hour). Therefore I have written the data preparation step procedures in Java - which reduced the time needed to perform the analysis down to around 1 minute.

So first let us start with Java code that computes number of subwords for each word in TSL2006 dictionary:

package scrabble;

import java.io.*;

public class Scrabble implements Runnable {

    public static int[][] b = new int[178691][26];
    public static int wlen[] = new int[178691];
    // Set number of threads to match your hardware
    public static final int MAX_THREADS = 6;

    public static void main(String[] args)
            throws FileNotFoundException, IOException {
        String file = "TWL2006.txt";
        int i;
        
        try (BufferedReader r =
             new BufferedReader(new FileReader(file))) {
            i = 0;
            for (String line; (line = r.readLine()) != null; i++) {
                for (char c : line.toCharArray()) { b[i][c - 'a']++; }
                wlen[i] = line.length();
            }
        }
        for (i = 0; iI < MAX_THREADS; i++) {
            new Thread(new Scrabble(i)).start();
        }
    }
    
    private final int counter;
            
    public Scrabble(int counter) {
        this.counter = counter;
    }
    
    @Override
    public void run() {
        String filename = "result_" + counter + ".txt";
        try (PrintWriter w = new PrintWriter(filename)) {
            w.println("length, subwords");
            for (int i = counter; i < b.length; i += MAX_THREADS) {
                int[] base = b[i];
                int subwordcount = 0;
                for (int j = 0;
                     (j < b.length) && (wlen[j] <= wlen[i]); j++) {
                    int[] subword = b[j];
                    boolean issubword = true;
                    for (int k = 0; k < 26; k++) {
                        if (subword[k] > base[k]) {
                            issubword = false; break;
                        }
                    }
                    if (issubword) { subwordcount++; }
                }
                w.println(wlen[i] + ", " + subwordcount);
            }
        } catch (FileNotFoundException ex) {
            System.out.println("Failed to open file " + filename);
        }        
    }
}

The code uses idea proposed in comments to my last post to hold a table counting number of occurrences of each letter in each word (array b). In order to speed up the computations the code takes advantage from the fact that words in file TWL2006.txt are sorted by their length and additionally is parallelized (6 threads by default) so produces 6 files named result_[i].txt., where i changes from 0 to 6. If you do not have Java compiler you can download Scrabble.jar here (but it is better to build the jar file from the source as your browser might complain about downloading jar files from untrusted sources).

And now the code that runs the analysis. It comes in three parts. First we plot the relationship between word length and logarithm of mean subword count in TWL2006 data set:

library(dplyr)

# you may run it outside R and skip the Java code
# computations should take ~1 minute for 6 threads
# additionally safeguard against accidental multiple generation
if (length(list.files(pattern="result_\\d+\\.txt")) == 0) {
    system("java -jar Scrabble.jar")
}

results <- list.files(pattern="result_\\d+\\.txt")
rbind_all(lapply(results, read.csv)) %>%
    group_by(length) %>%
    summarise(subwords=log(mean(subwords))) %>%
    plot(xlab = "word length", ylab = "log(mean(subwords))")

In the process we can see the power of dplyr ;). The result is the following:
And we get a different result. Number of subwords grows subexponentially with word length.

In order to examine the reason for such a situation let us plot the word length distributions of NGSL and TWL datasets:

twl2006_l <- nchar(readLines("TWL2006.txt"))
ngsl101_l <- nchar(readLines("NGSL101.txt")[-1])
par(lend=2)
hist(ngsl101_l, col=rgb(1,0,0,0.5), 0:15, FALSE, ylim=c(0, 0.2),
     xlab="letter count", main="")
hist(twl2006_l, col=rgb(0,0,1,0.5), 0:15, FALSE, add=TRUE)
legend("topright", c("NGSL101", "TWL2006"), title = "Dictionary",
       col=c(rgb(1,0,0,0.5), rgb(0,0,1,0.5)), lty=1, lwd=14)

The histograms produced are the following:
So we can see that basic (NGSL) words tend to be shorter than TWL words. We can then reason that it is easier for a sort word to be a subword of some longer word.

Having established the difference in word length distribution another question came to me. Does the distribution of letters in words of a given length differ significantly between NGSL and TWL. Here is the code that checks this:

letter.freq <- function(file.name, skip){
    sdf <- read.table(file.name, as.is = T,
                      col.names = "word", skip = skip) %>%
        mutate(len = nchar(word)) %>%
        # strip to words from 2 to 14 letters for comparability
        filter(len > 1 & len < 15) %>%
        group_by(len) %>%
        # generate distribution of letters by word length
        do(props=prop.table(table(factor(
           strsplit(paste(.$word, collapse=""),"")[[1]],letters))))
    letfreq <- do.call(cbind, sdf$props)
    colnames(letfreq) <- 2:14
    letfreq
}

library(lattice)

twl2006 <- letter.freq("TWL2006.txt", 0)
ngsl101 <- letter.freq("NGSL101.txt", 0)
levelplot(twl2006-ngsl101, main = "TWL2006-NGSL101",
          xlab = "letter", ylab = "word length")

The plot shows the differences of the distributions:
Interestingly we again can notice significant differences. For example the letter "t" seems to be on-the-average more frequent in NGSL words and the letter "s" in TWL words.

Unfortunately I am not an English language specialist to try to explain the patterns presented on the plots I have prepared today. If you have any insights concerning the distribution of letters in English words I welcome you to share your thoughts (and codes :)) in comments.


No comments:

Post a Comment

Note: Only a member of this blog may post a comment.