I rencently went through an old Prolog project made during my studies to solve Wordle. The program uses a file containing 7980 words that should be sorted by the "value" it brings. With this solver, there was a Python code that I made to generate this file using information theory. Because I was unable to understand what I wrote and because it was slow and probably not that good, I have decided to recode it but I'm facing a problem: my code is very very slow. To compute the entropy of 1 word, the best I get is 0.5s, so it takes 0.5*7980s to run which is (if my math is good) more than an hour !
Here the code I use for testing :
import math
import time
def read_words(path):
words = []
with open(path, "r") as file:
while line := file.readline():
line = line.replace("word([", "").replace("]).", "").replace("\n", "")
words.append("".join([l.strip().replace("'", "") for l in line.split(",")]))
return words
def get_int_pattern(soluce, w):
res = [0, 0, 0, 0, 0]
used = [False, False, False, False, False]
wrange = range(len(w))
for i in wrange:
if w[i] == soluce[i]:
res[i] = 2
used[i] = True
for i in wrange:
if used[i]:
continue
for j in wrange:
if not used[j] and w[i] == soluce[j]:
res[i] = 1
used[j] = True
break
res_value = 0
for i, r in enumerate(res):
res_value += r * 3 ** i
return res_value
def generate_int_pattern_matrix(all_words):
wlen = len(all_words)
matrix = {}
start_time = time.time()
for i, w in enumerate(all_words):
matrix[w] = {}
for w2 in all_words:
matrix[w][w2] = get_int_pattern(w, w2)
print("\rGenerating int matrix {}/{} {}s".format(i+1, wlen, round(time.time() - start_time, 1)), end="")
print()
return matrix
def int_pattern_match(word, pattern, test):
return get_int_pattern(test, word) == pattern
def int_pattern_match2(matrix, word, pattern, test):
return matrix[test][word] == pattern
all_words = read_words("../save/word_origin.pl")
wlen = len(all_words)
all_patterns = range(243)
pattern_matrix = generate_int_pattern_matrix(all_words)
HTimeC = HTimeL = HTimeS = 0
start_time = time.time()
for w in all_words:
H = 0
HTimeS = time.time()
for p in all_patterns:
count = 0
for w2 in all_words:
if int_pattern_match2(pattern_matrix, w, p, w2):
count += 1
if count == 0:
continue
proba = count / wlen
H += proba * -math.log2(proba)
HTimeC += time.time() - HTimeS
HTimeL += 1
print("\r{}/{} H={} averageHTime={} totalSec={}s".format(HTimeL, wlen, H, round(HTimeC/HTimeL, 3), round(time.time() - start_time, 1)), end="")
print()
For the explanations, I'm computing the entropy for each word, which correspond to the average information that we should get if we play it. For this, for each pattern of color possible (combination of 5 green, orange or black), I count the number of words that can have made this pattern if they were the solution. The information I get for a pattern is : -log2(p) where p = count / total words. The entropy of a word is the sum of the information value of each pattern multiplied by their probability.
A pattern is coded in a base 3 int because it is faster than string for comparison. ex: 189 = 2*3^4 + 1*3^3 + 0*3^2 + 0*3^1 + 0*3^0 = green, orange, black, black, black.
The int_pattern_match2(matrix, word, pattern, test) function checks if test matches a pattern by comparing the given pattern and the pattern produced when playing word with test as the solution. If they match, it means that test could be the solution.
The generate_int_pattern_matrix(all_words) function pre-compute all the patterns in approximately 2min. I have tried to save this in a file and load it to avoid running it gain but it is slower and makes Codium crash apparently.
I posted the file containing all the words here: https://drive.google.com/file/d/1TxU9LsGWpFJPckaAWCa5fABBnUVXRTm9/view?usp=sharing.
I remember getting inspirations from youTube videos of ScienceEtonnante (french) and 3Blue1Brown. The first one was saying that his code, for the first turn, takes like 1 minute to run... how am I that slow ? it has been 2 days and I don't know what to do anymore. How can I optimize my code ?
Thanks a lot to those who may answer.