Hill-climbing algorithm to generate a string using Levenshtein distance as heuristic in Python?

2.7k views Asked by At

I have been following this ebook and I am stuck at one of their Self Check questions, which goes on like this:

Self Check

Here’s a self check that really covers everything so far. You may have heard of the infinite monkey theorem? The theorem states that a monkey hitting keys at random on a typewriter keyboard for an infinite amount of time will almost surely type a given text, such as the complete works of William Shakespeare. Well, suppose we replace a monkey with a Python function. How long do you think it would take for a Python function to generate just one sentence of Shakespeare? The sentence we’ll shoot for is: “methinks it is like a weasel”

You’re not going to want to run this one in the browser, so fire up your favorite Python IDE. The way we’ll simulate this is to write a function that generates a string that is 27 characters long by choosing random letters from the 26 letters in the alphabet plus the space. We’ll write another function that will score each generated string by comparing the randomly generated string to the goal.

A third function will repeatedly call generate and score, then if 100% of the letters are correct we are done. If the letters are not correct then we will generate a whole new string.To make it easier to follow your program’s progress this third function should print out the best string generated so far and its score every 1000 tries.

Self Check Challenge

See if you can improve upon the program in the self check by keeping letters that are correct and only modifying one character in the best string so far. This is a type of algorithm in the class of ‘hill climbing’ algorithms, that is we only keep the result if it is better than the previous one.

I wrote some code that does the first part of this challenge using Levenshtein distance between generated and needed strings.

import random, string, nltk

def string_generator(length, collection):
    """
    takes all characters in collection and generates a string of size length.
    """
    return ''.join([random.choice(collection) for _ in xrange(length)])

def string_tester(output, text):
    """
    compares strings given and returns the Levenshtein distance.
    """
    return nltk.metrics.edit_distance(output, text)

if __name__ == '__main__':
    collection = [x for x in (string.ascii_lowercase + ' ')]
    longest_distance = 27
    best_string = None
    ctr = 0
    while True:
        random_string = string_generator(26, collection)
        distance = string_tester(random_string, "methinks it is like a weasel")
            ctr += 1
        ctr %= 1000
            if distance < longest_distance:
            best_string = random_string
            longest_distance = distance
            # end if the string generated is same as the one given
        if longest_distance == 0:
            print best_string
            print longest_distance
            break
            # use the best string to generate a better string every 1000th time
        if ctr == 0:
            print longest_distance
            print best_string
            # TODO: optimization here

I have no idea how can I generate a better string - using the best string until that iteration and given methods - at the TODO.

tl;dr: How can I write a hill climbing algorithm that uses Levenshtein distance as its heuristic until it generates a certain string? Please outline the process.

3

There are 3 answers

4
Joran Beasley On BEST ANSWER
target_word = "hello world"
target_size = len(target_word)
alphabet = "abcdefghijklmnopqrstuvwxyz "
def make_guess(alphabet,size):
   return "".join(random.choice(alphabet) for _ in range(size))

guess = make_guess(alphabet,target_size)

for i in itertools.count(0):
   if guess == target_word:
      break;
   if not i % 1000:
      print "Best So Far:",guess
   #climb hill and replace our guess if our next guess is better
   guess = min(guess,make_guess(alphabet,target_size),key=lambda _guess:levenstein(_guess,target_word))
print "Final Guess:",guess

this is called hill climbing because the potential solution only gets replaced if the next potential solution is better. this can lead to problems in other types of problem statements, where you will find local maxima or minima, that performs relatively well, but you will miss the global maxima or minima

0
Raydel Miranda On

See: Python collective intelligence programming.

This book brings examples and complete programs about IA algorithms, optimization and heuristics (all are IA algorithms to me), including the hill climbing algorithm.

0
Atul Singh On

Here is the complete solution for hill climbing which takes less than 100 iterations.

import string
import random

def randomGen(goalList):
    characters = string.ascii_lowercase+" "
    randString =""
    for i in range(len(goalList)):
        randString = randString+characters[random.randrange(len(characters))]
    randList = [randString[i] for i in range(len(randString))]
    return randList

def scoreRand(goalList,randList):
    numScore = 0
    for i in range(len(goalList)):
        if goalList[i] == randList[i]:
            numScore = numScore+1
    return numScore / len(goalList)

def common_elements(clist,list1, list2):
    for i in range(len(list1)):
        if list1[i] == list2[i]:
            clist[i] = list1[i]
    return clist

def main():
    goal = "methinks it is like a weasel"
    goalList = [goal[i] for i in range(len(goal))]
    clist = [' ' for i in range(len(goal))]
    randList = randomGen(goalList)
    clist = common_elements(clist,goalList, randList)
    score = scoreRand(goalList,clist)
    totalIteration = 0
    while(score < 1):
        newrandList = randomGen(goalList)
        newclist = common_elements(clist,goalList, randList)
        newscore = scoreRand(goalList,clist)
        score = newscore
        randList = newrandList
        clist = newclist
        totalIteration = totalIteration+1
        print(score," : ",''.join(clist))
    print("Total iterations: ",totalIteration)

main()