The question is whether the exponential relationship between the number of letters in the word and the number of its subwords that is observed in NGSL data set still holds for TWL2006.
The challenge is that NGSL has 2801 words and TWL2006 is much larger with 178691 words. You can download the file TWL2006.txt which contains the words and was prepared (converted to lowercase and sorted by word length) using data from Internet Scrabble Club website.
You could run codes from comments to my last post to obtain the results, but it takes ages to compute (over 1 hour). Therefore I have written the data preparation step procedures in Java - which reduced the time needed to perform the analysis down to around 1 minute.
So first let us start with Java code that computes number of subwords for each word in TSL2006 dictionary:
package scrabble;
import java.io.*;
public class Scrabble implements Runnable {
public static int[][] b = new int[178691][26];
public static int wlen[] = new int[178691];
// Set number of threads to match your hardware
public static final int MAX_THREADS = 6;
public static void main(String[] args)
throws FileNotFoundException, IOException {
String file = "TWL2006.txt";
int i;
try (BufferedReader r =
new BufferedReader(new FileReader(file))) {
i = 0;
for (String line; (line = r.readLine()) != null; i++) {
for (char c : line.toCharArray()) { b[i][c - 'a']++; }
wlen[i] = line.length();
}
}
for (i = 0; iI < MAX_THREADS; i++) {
new Thread(new Scrabble(i)).start();
}
}
private final int counter;
public Scrabble(int counter) {
this.counter = counter;
}
@Override
public void run() {
String filename = "result_" + counter + ".txt";
try (PrintWriter w = new PrintWriter(filename)) {
w.println("length, subwords");
for (int i = counter; i < b.length; i += MAX_THREADS) {
int[] base = b[i];
int subwordcount = 0;
for (int j = 0;
(j < b.length) && (wlen[j] <= wlen[i]); j++) {
int[] subword = b[j];
boolean issubword = true;
for (int k = 0; k < 26; k++) {
if (subword[k] > base[k]) {
issubword = false; break;
}
}
if (issubword) { subwordcount++; }
}
w.println(wlen[i] + ", " + subwordcount);
}
} catch (FileNotFoundException ex) {
System.out.println("Failed to open file " + filename);
}
}
}
And now the code that runs the analysis. It comes in three parts. First we plot the relationship between word length and logarithm of mean subword count in TWL2006 data set:
library(dplyr)
# you may run it outside R and skip the Java code
# computations should take ~1 minute for 6 threads
# additionally safeguard against accidental multiple generation
if (length(list.files(pattern="result_\\d+\\.txt")) == 0) {
system("java -jar Scrabble.jar")
}
results <- list.files(pattern="result_\\d+\\.txt")
rbind_all(lapply(results, read.csv)) %>%
group_by(length) %>%
summarise(subwords=log(mean(subwords))) %>%
plot(xlab = "word length", ylab = "log(mean(subwords))")
In the process we can see the power of dplyr ;). The result is the following:
And we get a different result. Number of subwords grows subexponentially with word length.
In order to examine the reason for such a situation let us plot the word length distributions of NGSL and TWL datasets:
twl2006_l <- nchar(readLines("TWL2006.txt"))
ngsl101_l <- nchar(readLines("NGSL101.txt")[-1])
par(lend=2)
hist(ngsl101_l, col=rgb(1,0,0,0.5), 0:15, FALSE, ylim=c(0, 0.2),
xlab="letter count", main="")
hist(twl2006_l, col=rgb(0,0,1,0.5), 0:15, FALSE, add=TRUE)
legend("topright", c("NGSL101", "TWL2006"), title = "Dictionary",
col=c(rgb(1,0,0,0.5), rgb(0,0,1,0.5)), lty=1, lwd=14)
So we can see that basic (NGSL) words tend to be shorter than TWL words. We can then reason that it is easier for a sort word to be a subword of some longer word.
Having established the difference in word length distribution another question came to me. Does the distribution of letters in words of a given length differ significantly between NGSL and TWL. Here is the code that checks this:
letter.freq <- function(file.name, skip){
sdf <- read.table(file.name, as.is = T,
col.names = "word", skip = skip) %>%
mutate(len = nchar(word)) %>%
# strip to words from 2 to 14 letters for comparability
filter(len > 1 & len < 15) %>%
group_by(len) %>%
# generate distribution of letters by word length
do(props=prop.table(table(factor(
strsplit(paste(.$word, collapse=""),"")[[1]],letters))))
letfreq <- do.call(cbind, sdf$props)
colnames(letfreq) <- 2:14
letfreq
}
library(lattice)
twl2006 <- letter.freq("TWL2006.txt", 0)
ngsl101 <- letter.freq("NGSL101.txt", 0)
levelplot(twl2006-ngsl101, main = "TWL2006-NGSL101",
xlab = "letter", ylab = "word length")
Interestingly we again can notice significant differences. For example the letter "t" seems to be on-the-average more frequent in NGSL words and the letter "s" in TWL words.
Unfortunately I am not an English language specialist to try to explain the patterns presented on the plots I have prepared today. If you have any insights concerning the distribution of letters in English words I welcome you to share your thoughts (and codes :)) in comments.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.