Python no output comparing string with imported large dictionary file

173 views Asked by At

I'm trying to write code to help me at crossword puzzle. I'm experiencing the following errors.

1.When I try to use the much larger text file with my word list I receive no output only the small 3 string word list works.

2.The match test positive for the first two strings of my test word list. I need it to only test true for the entire words in my word list. [ SOLVED SOLUTION in the code bellow ]

lex.txt contains

dad

add

test

I call the code using the following.
./cross.py dad

[ SOLVED SOLUTION ] This is really slow.

#!/usr/bin/env python

import itertools, sys, re

sys.dont_write_bytecode = True
original_string=str(sys.argv[1])
lenth_of_string=len(original_string)
string_to_tuple=tuple(original_string)


with open('wordsEn.txt', 'r') as inF:
    for line in inF:
        for a in set (itertools.permutations(string_to_tuple, lenth_of_string)):
            joined_characters="".join(a)
            if re.search('\\b'+joined_characters+'\\b',line):
                print joined_characters
1

There are 1 answers

2
Danstahr On

Let's take a look at your code. You take the input string, you create all possible permutations of it and then you look for these permutations in the dictionary.

The most significant speed impact from my point of view is that you create the permutations of the word over and over again, for every word in your dictionary. This is very time consuming.

Besides of that, you don't even need the permutations. It's obvious that two words can be "converted" to each other by permuting if they've got the same letters. So your piece of code can be reimplemented as follows :

import itertools, sys, re
import time
from collections import Counter


sys.dont_write_bytecode = True
original_string=str(sys.argv[1]).strip()
lenth_of_string=len(original_string)
string_to_tuple=tuple(original_string)

def original_impl():
    to_return = []
    with open('wordsEn.txt', 'r') as inF:
        for line in inF:
            for a in set (itertools.permutations(string_to_tuple, lenth_of_string)):
                joined_characters="".join(a)
                if re.search('\\b'+joined_characters+'\\b',line):
                    to_return.append(joined_characters)
    return to_return

def new_impl():
    to_return = []
    stable_counter = Counter(original_string)
    with open('wordsEn.txt', 'r') as inF:
        for line in inF:
            l = line.strip()
            c = Counter(l)
            if c == stable_counter:
                to_return.append(l)
    return to_return

t1 = time.time()
result1 = original_impl()
t2 = time.time()
result2 = new_impl()
t3 = time.time()

assert result1 == result2

print "Original impl took ", t2 - t1, ", new impl took ", t3 - t2, "i.e. new impl is ", (t2-t1) / (t3 - t2), " faster"

For a dictionary with 100 words of 8 letters, the output is :

Original impl took  42.1336319447 , new impl took  0.000784158706665 i.e. new impl is  53731.0006081  faster

The time consumed by the original implementation for 10000 records in the dictionary is unbearable.