I am having a list of keywords(50-100 keywords or set words) for 9 categories and and running a phrase matcher from Spacy to check the sample text against each category's keywords with they get a match with that. So, the solution shall assign the category the text belongs to using the phrase-matcher here.
Keywords list:
strength = ['leading', 'honesty', 'public speaking', 'right decision making', ...]
weakness = ['Uncomfortable with public speaking', 'Risk-averse', 'Insecure', 'Limited experience in a particular skill or software', ...]
and so on.
Problem:
- Text is matching with the category which is no match with the keywords and that category is coming for most of the texts. E.g. 'strength' category comes for text that should belong to qualities or weakness qualities.
- Phrase-matcher is giving me a list of possible matches and there are more than one matches for most of the results.
Results I get:
For weakness category, ['strength'] and ['strength','weakness'] as matches. This issue is with other categories too. Strength comes up all almost 70% of the test results as first match.
I am using Spacy 3.6.0 and en-core-web-lg with Python 3.10.
claim_subcat(txt)
method takes lemmatized text without punctuations as parameter and returns the list of matched categories.
def claim_subcat(self, txt):
try:
claim = self.keywords_df
item_name=[]
for i in range(len(claim)):
term=claim['keys'][i]
term =term.replace('[','')
term =term.replace(']','')
term =term.replace("'",'')
term = term.split(", ")
if self.extract_claim(term,txt) == True: # passing keywords and text in extract_claim method
item_name.append(claim.loc[i,'modules'])
return item_name
except:
print(f"Unexpected error occurred")
sys.exit(1)
extract_claim
takes keywords and text and returns True if matches are more than 0, else False.
def extract_claim(self,k1, text):
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')
patterns = [nlp.make_doc(_text) for _text in k1]
matcher.add("1",None, *patterns)
doc = nlp(text)
matches = matcher(doc)
if len(matches)==0:
#return doc
return False
return True
What I expect:
Text = "I am a very hard working person with good public speaking skills."
shall give meitem_name = ['strength']
on callingclaim_subcat
method.Text = "I procrastinate and often insecure."
shall give meitem_name = ['weakness']
on callingclaim_subcat
method.