I have two lists of synsets generated from wordnet.synsets():
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
#convert tag to the one used by wordnet
def convert_tag(tag):
tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
try:
return tag_dict[tag[0]]
except KeyError:
return None
#define a function to find synset reference
def doc_to_synsets(doc):
token = nltk.word_tokenize(doc)
tag = nltk.pos_tag(token)
wordnet_tag = convert_tag(tag)
syns = [wn.synsets(token, wordnet_tag) for token in nltk.word_tokenize(doc)]
syns_list = [token[0] for token in syns if token]
return syns_list
#convert two example text documents
doc1 = 'This is a test function.'
doc2 = 'Use this function to check if the code in doc_to_synsets is correct!'
s1 = doc_to_synsets(doc1)
s2 = doc_to_synsets(doc2)
I am trying to write a function to find the synset in s2 with the largest 'path similarity' score for each synset in s1. Hence, for s1, which contains 4 unique synsets, the function should return 4 path similarity scores, from which I will convert into a pandas Series object for ease of computation.
I have been working on this following code so far
def similarity_score(s1, s2):
list = []
for word1 in s1:
best = max(wn.path_similarity(word1, word2) for word2 in s2)
list.append(best)
return list
However, it only return an empty list without any values in it.
[]
Would anyone care to look at what's wrong with my for loop and perhaps enlighten me on this subject?
Thank you.
I removed the "Sysnet" class references since I don't have whatever that class is, and it doesn't matter for scoring purposes. The score function is abstracted out so you can define it however you like. I took a stab at a very simplistic rule. It compares each position, demarcated by the
.
separators, to see if they are equal. If they are, the score is incremented. For example, ins1
,be.v.01
compared to a made upbe.f.02
would have a score of 1, because on the prefix matches. If instead we compared tobe.v.02
, we would have a score of 2, etc.