I am a business student who just began to learn Python. My professor asked me to do fuzzy matching between two files: US Patent information and Company information downloaded from stock exchange website. My task is to compare the company names that showed up in US Patent documentation (column 1 from file 1) and names found on stock exchange website(column 1 from file 2) . From what I’ve known, the (1) first step is to change all the letters listed file 1 and file 2 to lower cases; (2) Pick each name from file 2 and match it with all the names in file 1 and return 15 closest matches. (3) Repeat step 2, run through all the names is file 2. (4) With every match, there is one similarity level. I guess I will use the SequenceMatcher() object. I just learn how to import data from my csv file(I have 2 files), see below
import csv
with open('USPTO.csv', 'rb') as csvfile:
data = csv.reader(csvfile, delimiter=',')
for row in data:
print "------------------"
print row
print "------------------"
for cell in row:
print cell
Sorry about my silly question but I am too new to replace the strings (“abcde”, “abcde”, as shown below) data with my own data. I have no idea how to change the data I imported to lower cases. And I don’t even know how to set the 15 closest matches standard. My professor told me this was an easy task, but I really felt defeated. Thank you for reading! Hopefully someone can give me some instructions. I am not that stupid :)
>>> import difflib
>>> difflib.SequenceMatcher(None, 'abcde', 'abcde').ratio()
1.0
To answer your questions one by one.
1) "I have no idea how to change the data I imported to lower cases."
In order to change the cell to lower case, you would use [string].lower()
The following code will print out each cell in lower case
So to change each cell to lower case you would do
2) "I don’t even know how to set the 15 closest matches standard."
For this you should set up a dictionary, the key will be the first string, the value will be a list of pairs, (string2, the value from difflib.SequenceMatcher(None, string1, string2).ratio()).
Please attempt to write some code and we will help you fix it.
Look at https://docs.python.org/2/tutorial/datastructures.html for how to construct a dictionary