Organization of fuzzy matches

384 views Asked by At

I've gone through and fuzzy matched each element in a list of 20,000+ movie titles with each other element, which returns a value for each pair:

from fuzzywuzzy import fuzz

titles = ['Scary Movie', 'Happy Movie', 'Sappy Movie', 'Crappy Movie']

print fuzz.ratio(titles[2],titles[3])
>>> 91 #/100 higher value denotes closer match 

for x in titles:
    for y in titles:
        fuzzed = fuzz.ratio(x,y)

    print "value for %r and %r is %r" % (x, y, fuzzed)

How can I organize this data efficiently? More specifically- how can I get matches to group together based on their match value?

Capturing the return values from the nested loops and then packaging them with x and y into tuples or lists is obviously redundant and messy. I attempted an implementation using classes but I'm missing something.

3

There are 3 answers

2
zero323 On BEST ANSWER

Using list comprehensions and itertools.product:

from itertools import product
[(x, y, fuzz.ratio(x, y)) for (x, y) in product(titles, repeat=2)]

Nice and lazy solution using toolz

from toolz.curried import pipe, filter, map
pipe(
    product(titles, repeat=2),
    # No reason to check duplicates
    filter(lambda (x, y): x > y),
    map(lambda (x, y): (x, y, fuzz.ratio(x, y))))
0
Anand S Kumar On

Maybe you can store the fuzzed ratio in a dictionary with the (x,y) tuple as the key, would make it easier to search the ratio for each pair later on. For that you can create an empty dictionary outside the for loop , and then in for loop assign the fuzz.ratio(x , y) to the key (x, y) for that dictionary.

Example code -

fuzzDict = {}
for x in titles:
    for y in titles:
        fuzzDict[(x,y)] = fuzz.ratio(x,y)

Later when you want to retrieve the ratio, you can simple call fuzzDict[(x , y)] to get it.


You can also use dictionary comprehension for that in Python 2.7+ -

{(x, y) : fuzz.ratio(x,y) for x in titles for y in titles}
0
Navith On

You only need to iterate over combinations of the titles since the ratio doesn't depend on the order. This is significantly faster than iterating over a product of it.

For your list of 20,000 titles, you would iterate over 400 000 000 pairs if you used product. With combinations, you are only iterating over 199 990 000.

from fuzzywuzzy import fuzz

import collections
import itertools

titles = ['Scary Movie', 'Happy Movie', 'Sappy Movie', 'Crappy Movie']

Then you can store the ratios in a dictionary where you can look up a ratio to get a set of pairs with that ratio.

fuzzes_by_ratio = collections.defaultdict(set)

Or in a dictionary where you can look up a frozenset of pairs and get their ratio.

fuzzes_per_pair = {}

-

for m1, m2 in itertools.combinations(titles, 2):
    pair = frozenset({m1, m2})
    ratio = fuzz.ratio(m1, m2)

    fuzzes_by_ratio[ratio].add(pair)
    fuzzes_per_pair[pair] = ratio

Then you can retrieve the data later:

# I don't have fuzzywuzzy installed so this is just made up:
>>> fuzzes_by_ratio[91]
{frozenset({"Scary Movie", "Happy Movie"}), frozenset({"Sappy Movie", "Happy Movie"})}

>>> fuzz_per_pair[frozenset({"Scary Movie", "Sappy Movie"})]
82

Keep in mind that you will need tons of memory for this storage. You can halve it by using only one of the two methods above, depending on your needs / convenience.