Business student totally new to Python wants a script for strings fuzzy matching

261 views Asked by At

I am a business student who just began to learn Python. My professor asked me to do fuzzy matching between two files: US Patent information and Company information downloaded from stock exchange website. My task is to compare the company names that showed up in US Patent documentation (column 1 from file 1) and names found on stock exchange website(column 1 from file 2) . From what I’ve known, the (1) first step is to change all the letters listed file 1 and file 2 to lower cases; (2) Pick each name from file 2 and match it with all the names in file 1 and return 15 closest matches. (3) Repeat step 2, run through all the names is file 2. (4) With every match, there is one similarity level. I guess I will use the SequenceMatcher() object. I just learn how to import data from my csv file(I have 2 files), see below

import csv
with open('USPTO.csv', 'rb') as csvfile:
    data = csv.reader(csvfile, delimiter=',')
    for row in data:
        print "------------------"
        print row
        print "------------------"
        for cell in row:
            print cell

Sorry about my silly question but I am too new to replace the strings (“abcde”, “abcde”, as shown below) data with my own data. I have no idea how to change the data I imported to lower cases. And I don’t even know how to set the 15 closest matches standard. My professor told me this was an easy task, but I really felt defeated. Thank you for reading! Hopefully someone can give me some instructions. I am not that stupid :)

>>> import difflib
>>> difflib.SequenceMatcher(None, 'abcde', 'abcde').ratio()
1.0
1

There are 1 answers

0
Eric Sauer On BEST ANSWER

To answer your questions one by one.

1) "I have no idea how to change the data I imported to lower cases."

In order to change the cell to lower case, you would use [string].lower()

The following code will print out each cell in lower case

import csv
with open('USPTO.csv', 'rb') as csvfile:
    data = csv.reader(csvfile, delimiter=',')
    for row in data:
        print "------------------"
        print row
        print "------------------"
        for cell in row:
            print cell.lower();

So to change each cell to lower case you would do

import csv
with open('USPTO.csv', 'rb') as csvfile:
    data = csv.reader(csvfile, delimiter=',')
    for row in data:
        for cell in row:
            cell = cell.lower();

2) "I don’t even know how to set the 15 closest matches standard."

For this you should set up a dictionary, the key will be the first string, the value will be a list of pairs, (string2, the value from difflib.SequenceMatcher(None, string1, string2).ratio()).

Please attempt to write some code and we will help you fix it.

Look at https://docs.python.org/2/tutorial/datastructures.html for how to construct a dictionary