Python: Comparing 2 sets of data, yield best match and match %

918 views Asked by At

I've seen lots of Q&A on this topic, but none contain the type of output I'm looking for. Any words of wisdom on this would be very much appreciated!

  • I have 2 lists... both lists contain 1 column, consisting of Full Name|University (i.e., name and university, concatenated, and separated by a pipe)
  • There's not always an exact match, due to nicknames and university abbreviations. I want to compare each record in list 1 with each record in list 2, and find the closest match.
  • I then want to produce an output file with 3 columns: Every item from list 1, The closest match from list 2, and the match %.

Does anyone have sample code they could share? Thanks!

1

There are 1 answers

0
David Whitlock On BEST ANSWER

To get you started, here is an answer which can provide matches on either the full name or the university - you could extend it to include fuzzy search using a library like fuzzywuzzy:

  1. For both lists, split each string into a [full name, university] list (if some of the strings don't contain the '|' character, you might need to wrap this in a try, except or an if statement):

    new_list = [item.split('|') for item in old_list]

  2. Run the following command to match on either element (assuming that one list is called list1 and the other list is called list2):

    matches = [val for val in list1 for item in list2 if val[0] == item[0] or val[1] == item[1]]