I have two lists of synsets generated from wordnet.synsets():
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
#convert tag to the one used by wordnet
def convert_tag(tag):
tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
return tag_dict[tag[0]]
except KeyError:
return None
#define a function to find synset reference
def doc_to_synsets(doc):
token = nltk.word_tokenize(doc)
tag = nltk.pos_tag(token)
wordnet_tag = convert_tag(tag)
syns = [wn.synsets(token, wordnet_tag) for token in nltk.word_tokenize(doc)]
syns_list = [token[0] for token in syns if token]
return syns_list
#convert two example text documents
doc1 = 'This is a test function.'
doc2 = 'Use this function to check if the code in doc_to_synsets is correct!'
s1 = doc_to_synsets(doc1)
s2 = doc_to_synsets(doc2)
I am trying to write a function to find the synset in s2 with the largest 'path similarity' score for each synset in s1. Hence, for s1, which contains 4 unique synsets, the function should return 4 path similarity scores, from which I will convert into a pandas Series object for ease of computation.
I have been working on this following code so far
def similarity_score(s1, s2):
list = []
for word1 in s1:
best = max(wn.path_similarity(word1, word2) for word2 in s2)
return list
However, it only return an empty list without any values in it.
Would anyone care to look at what's wrong with my for loop and perhaps enlighten me on this subject?
Thank you.
I removed the "Sysnet" class references since I don't have whatever that class is, and it doesn't matter for scoring purposes. The score function is abstracted out so you can define it however you like. I took a stab at a very simplistic rule. It compares each position, demarcated by the
separators, to see if they are equal. If they are, the score is incremented. For example, ins1
compared to a made upbe.f.02
would have a score of 1, because on the prefix matches. If instead we compared tobe.v.02
, we would have a score of 2, etc.