What a wonderful ‘wordle’!

Best opening words for Wordle (as per me!)

R
Author

Me

Published

April 5, 2022

Who ever thought that a bunch of black and green boxes would bring out the logophile in us all? With friends and family groups sharing their progress, I find this to be an entertaining mind-puzzle to kickstart the day.

And I was not alone in my quest for 5 letter words. Wordle has tickled the fascination of many in the data science community. I found Arthur Holtz’s lucid breakdown of the Wordle dataset quite interesting. Of course, there is 3B1B’s incredibly detailed videos on applying Information Theory to this 6-by-5 grid. (original video as well as the follow-up errata)

Others have simulated the wordle game (like here) or even solved it for you (like this blog). I’ve read at least one blog post that has an academic take on the matter.

Fortunately for the reader, none of the above will be attempted by me. My inspiration comes from Gerry Chng’s Frequency Analysis Approach where I’ve tried to understand the most commonly occuring letters in the official word list by position by considering a ranking mechanism

What is a wordle?

The game rules are fairly simple:

  1. You need to guess a 5-letter word. One new word is given every day
  2. You are given 6 guesses
  3. After every guess, each square is coded by a color
    • GREY: chosen letter is not in the word
    • YELLOW: chosen letter is in the word by wrong position
    • GREEN: chosen letter is in the word and in the correct position
  4. Repetition of letters is allowed

That’s it!

In my opinion, one of the reasons for the game going viral is the way the results are shared. You’ve possibly seen something like this floating around:

Sample world share

…And if your family too has been bitten hard by the Wordle bug, then you would be familiar with group messages like this!

World share in whatsapp

Frequency analysis

Arthur Hotlz’s blog is a good place to start for extracting and loading the Official Wordle list. After parsing and cleaning the data, here’s all the words broken down into a single rectangular dataframe word_list .

Update 29th Jan ’23: NYT’s .js file is not retrieving any list for some reason. I’ve referred to Arjun Vikram’s repo on dagshub

Code
knitr::opts_chunk$set(warning = FALSE, message = FALSE) 
suppressMessages({ 
library(httr)
library(dplyr)
library(stringr)
library(ggplot2)
library(ggthemes)
library(scales)
library(tidyr)
library(tibble)
library(forcats)
library(knitr)
library(kableExtra)
theme_set(theme_light())
})

url <- "https://www.nytimes.com/games/wordle/main.18637ca1.js" #not working
url2 <- "https://dagshub.com/arjvik/wordle-wordlist/raw/e8d07d33a59a6b05f3b08bd827385604f89d89a0/answerlist.txt"
wordle_script_text <- GET(url2) %>% 
  content(as = "text", encoding = "UTF-8")
# word_list = substr(
#   wordle_script_text,
#   # cigar is the first word
#   str_locate(wordle_script_text, "cigar")[,"start"],
#   # shave is the last word
#   str_locate(wordle_script_text, "shave")[,"end"]) %>%
#   str_remove_all("\"") %>%
#   str_split(",") %>%
#   data.frame() %>%
#   select(word = 1) %>%
#   mutate(word = toupper(word))


wordle_list <- str_split(wordle_script_text, "\n")

wordle_list <- data.frame(wordle_list) 

wordle_list <- rename(wordle_list, word = names(wordle_list)[1] ) %>% mutate(word = toupper(word)) #renaming column to 'word'

dim(wordle_list)
[1] 2310    1
Code
head(wordle_list)
   word
1 CIGAR
2 REBUT
3 SISSY
4 HUMPH
5 AWAKE
6 BLUSH

Modification to the above is another dataframe with each of the characters separated into columns which we’ll call position_word_list

The line select(-x) removes the empty column that is created due to the seperate() function

Code
position_word_list <- wordle_list %>% 
  separate(word, 
           sep = "", 
           into = c("x","p1","p2","p3","p4","p5")) %>% 
  select(-x)
head(position_word_list,10)
   p1 p2 p3 p4 p5
1   C  I  G  A  R
2   R  E  B  U  T
3   S  I  S  S  Y
4   H  U  M  P  H
5   A  W  A  K  E
6   B  L  U  S  H
7   F  O  C  A  L
8   E  V  A  D  E
9   N  A  V  A  L
10  S  E  R  V  E

Now onto some frequency analysis. Here’s a breakdown of all the letters in the wordle list sorted by number of occurrences stored in letter_list and creating a simple bar graph.

Code
letter_list <- wordle_list %>%
  as.character() %>%
  str_split("") %>% 
  as.data.frame() %>% 
  select(w_letter = 1) %>% 
  filter(row_number()!=1) %>%
  filter(w_letter %in% LETTERS) %>% 
  mutate(type = case_when(w_letter %in% c("A","E","I","O","U") ~ "vowel",
                          T ~ "consonant")) %>% 
  group_by(w_letter, type) %>% 
  summarise(freq = n()) %>% 
  arrange(desc(freq))

letter_list %>% ungroup() %>% 
  ggplot(aes(x = reorder(w_letter, -freq), y = freq))+
  geom_col(aes(fill = type))+
  scale_y_continuous(labels = comma)+
  geom_text(aes(label = freq), 
            size = 3)+
  labs(x = "Letter", y = "Frequency",
       title = "Frequency of letters in Official Wordle list")

This is interesting. Now I’m curious to know the top words by each position. To do this, I created a single table called freq_table that provides me with the frequency of occurrences by position for each letter. To iterate this process across all the 5 places, I used a for loop. Output is generated via the kableExtra package which provides a neat scrollable window

Code
#declaring null table
freq_table <- tibble(alpha = LETTERS)

for(i in 1:5){
    test <- position_word_list %>% 
    select(all_of(i)) %>%
# group_by_at() used for column index ID
    group_by_at(1) %>% 
    summarise(f = n()) %>% 
    arrange(desc(f)) %>% 
#first column returns p1, p2.. etc and is standardised
    rename(a = 1) 

#adding the freq values to a new dataframe
    freq_table <- freq_table %>%
    left_join(test, by = c("alpha" = "a")) 

#renaming column name to reflect the position number
    colnames(freq_table)[1+i] = paste0("p",i)
    rm(test)
}
#replacing NA with zero
freq_table[is.na(freq_table)] <- 0 
#output using kable's scrollable window 
kable(freq_table, 
      format = "html", 
      caption = "Frequency Table") %>%
    kable_styling() %>%
    scroll_box(width = "70%", height = "300px") %>% 
  kable_classic()
Frequency Table
alpha p1 p2 p3 p4 p5
A 140 304 306 162 63
B 173 16 56 24 11
C 198 40 56 150 31
D 111 20 75 69 118
E 72 241 177 318 422
F 135 8 25 35 26
G 115 11 67 76 41
H 69 144 9 28 137
I 34 201 266 158 11
J 20 2 3 2 0
K 20 10 12 55 113
L 87 200 112 162 155
M 107 38 61 68 42
N 37 87 137 182 130
O 41 279 243 132 58
P 141 61 57 50 56
Q 23 5 1 0 0
R 105 267 163 150 212
S 365 16 80 171 36
T 149 77 111 139 253
U 33 185 165 82 1
V 43 15 49 45 0
W 82 44 26 25 17
X 0 14 12 3 8
Y 6 22 29 3 364
Z 3 2 11 20 4

This table looks good. However, for my visualisation, I want to plot the top 10 letters in each position. For this, I’m going to use pivot_longer() to make it easier to generate the viz.

Code
freq_table_long10 <- freq_table %>% 
  pivot_longer(cols = !alpha, names_to = "position", values_to = "freq") %>% 
  select(position, alpha, freq) %>% 
  arrange(position, -freq) %>% 
  group_by(position) %>% 
  slice_head(n = 10) %>% ungroup

kable(freq_table_long10, 
      format = "html", 
      caption = "Top 10 letters within each position") %>%
    kable_styling() %>%
    scroll_box(height = "200px") %>% 
  kable_classic()
Top 10 letters within each position
position alpha freq
p1 S 365
p1 C 198
p1 B 173
p1 T 149
p1 P 141
p1 A 140
p1 F 135
p1 G 115
p1 D 111
p1 M 107
p2 A 304
p2 O 279
p2 R 267
p2 E 241
p2 I 201
p2 L 200
p2 U 185
p2 H 144
p2 N 87
p2 T 77
p3 A 306
p3 I 266
p3 O 243
p3 E 177
p3 U 165
p3 R 163
p3 N 137
p3 L 112
p3 T 111
p3 S 80
p4 E 318
p4 N 182
p4 S 171
p4 A 162
p4 L 162
p4 I 158
p4 C 150
p4 R 150
p4 T 139
p4 O 132
p5 E 422
p5 Y 364
p5 T 253
p5 R 212
p5 L 155
p5 H 137
p5 N 130
p5 D 118
p5 K 113
p5 A 63

So we have the # of occurences in each position laid out in a tidy format in one long rectangular dataframe. Now sprinkling some magic courtesy ggplot

Side note on reordering within facets

I tried my best to understand why I was unable to sort within each facet in spite of using free_y. Apparently that’s a known issue and a workaround has been discussed by David Robinson, Julia Silger and Tyler Rinker. To achieve this, two more functions need to be created reorder_within and scale_y_reordered

Code
reorder_within <- function(x, by, within, fun = mean, sep = "___", ...) {
  new_x <- paste(x, within, sep = sep)
  stats::reorder(new_x, by, FUN = fun)
}

scale_y_reordered <- function(..., sep = "___") {
  reg <- paste0(sep, ".+$")
  ggplot2::scale_y_discrete(labels = function(x) gsub(reg, "", x), ...)
}

freq_table_long10 %>% 
  mutate(type = case_when(alpha %in% c("A","E","I","O","U") ~ "vowel",
                          T ~ "consonant")) %>% 
  ggplot(aes(y = reorder_within(alpha, freq, position), x = freq))+
  geom_col(aes(fill = type))+
  scale_y_reordered()+
  facet_wrap(~position, 
             scales = "free_y", 
             ncol = 5)+
  labs(x = "Frequency", y = "Letter",
       title = "Frequency of top 10 letters by position in Official Wordle list ",
       caption = "D.S.Ramakant Raju\nwww.linkedin.com/in/dsramakant/")

Aha! Things are starting to get more clearer. Highly common letters in the 1st position are S, C, B, T and P - notice there’s only 1 vowel (A) that occurs in the top 10. Vowels appear more frequently in the 2nd and 3rd positions. Last position has a higher occurrence of E, Y, T, R & L

Which words can be the best Worlde openers?

Armed with the above knowledge, we now can filter out the commonly occurring words. Also I use a naive method to rank these words basis the occurrence of the letters. For instance, in the picture above, the word S A I N T seems to be a valid word comprising of the top occurring letters.

Admittedly, I use a pretty crude method to determine the best openers. Known drawbacks of this methodology are:

  1. Doesn’t consider the future path of the word (number of steps to get to the right word)
  2. Only considers the rank of the letters and not the actual probability of occurrence

With that out of the way, I was able to determine that there are 39 words that can be formed with the top 5 occurring letters in each position. I’ve created a score that is determined by the rank of each letter within its position. For instance, S A I N T gets a score of 9 by summing up 1 (S in first position) + 1 (A in second position) + 2 (I in third) + 2 (N in fourth) + 3 (T in fifth). The lower the score, the higher the frequency of occurrences. Scroll below to read the rest of the words.

Code
#function to pick the top 5 letters
top5_selection <- function(x)
{x %>% arrange(desc(x[2])) %>% head(5) %>% select(1)}
#defining null table
final_grid <- tibble(ranking = 1:5)

for(i in 2:length(freq_table)){
  t <- top5_selection(select(freq_table,1,all_of(i)))
  final_grid <- cbind(final_grid,t)
  colnames(final_grid)[i] = paste0("p",i-1)
}
topwords <- position_word_list %>% 
filter(p1 %in% final_grid$p1,
       p2 %in% final_grid$p2,
       p3 %in% final_grid$p3,
       p4 %in% final_grid$p4,
       p5 %in% final_grid$p5) 

#finding consolidated score of each word
topwords %<>%
  rowwise() %>% 
  mutate(p1_rank = which(p1 == final_grid$p1),
         p2_rank = which(p2 == final_grid$p2),
         p3_rank = which(p3 == final_grid$p3),
         p4_rank = which(p4 == final_grid$p4),
         p5_rank = which(p5 == final_grid$p5))

topwords2 <- topwords %>% 
  transmute(word = paste0(p1,p2,p3,p4,p5),
         score = sum(p1_rank, p2_rank,p3_rank, p4_rank, p5_rank)) %>% 
  arrange(score)

kable(topwords2, 
      format = "html",
      caption = "Top 39 words") %>%
    kable_styling() %>%
    scroll_box(width = "50%", height = "400px") %>% 
  kable_classic()
Top 39 words
word score
SAINT 9
CRANE 9
COAST 11
BRINE 11
CEASE 11
CRONE 11
CAUSE 12
CRIER 12
BRINY 12
BOAST 12
TAINT 12
CRONY 12
TEASE 13
POISE 13
TOAST 13
PAINT 13
BOOST 14
POINT 14
COUNT 14
PRONE 14
BEAST 14
PRINT 15
PAUSE 15
TAUNT 15
PROSE 15
CREST 15
CRUST 16
BRIAR 16
BOULE 16
POESY 16
CRUEL 16
PRUNE 16
BRUNT 16
TRUER 17
TREAT 18
TRIAL 18
TRUST 18
TRULY 19
TROLL 20

There we have it. My take on the best opening words.

I’ve used words such as SAINT, CRANE, COAST etc and they’ve been reasonably useful to me.

Which are your favourite opening words? Please do leave a comment to let me know!