Multiple and wrong results for Spacy Phrasematcher

39 views Asked by At

I am having a list of keywords(50-100 keywords or set words) for 9 categories and and running a phrase matcher from Spacy to check the sample text against each category's keywords with they get a match with that. So, the solution shall assign the category the text belongs to using the phrase-matcher here.

Keywords list:

strength = ['leading', 'honesty', 'public speaking', 'right decision making', ...]
weakness = ['Uncomfortable with public speaking', 'Risk-averse', 'Insecure', 'Limited experience in a particular skill or software', ...]
and so on.

Problem:

  1. Text is matching with the category which is no match with the keywords and that category is coming for most of the texts. E.g. 'strength' category comes for text that should belong to qualities or weakness qualities.
  2. Phrase-matcher is giving me a list of possible matches and there are more than one matches for most of the results.

Results I get:

For weakness category, ['strength'] and ['strength','weakness'] as matches. This issue is with other categories too. Strength comes up all almost 70% of the test results as first match.

I am using Spacy 3.6.0 and en-core-web-lg with Python 3.10.

claim_subcat(txt) method takes lemmatized text without punctuations as parameter and returns the list of matched categories.

def claim_subcat(self, txt):

        try:
            claim = self.keywords_df
            item_name=[]
            for i in range(len(claim)):
                term=claim['keys'][i]
                term =term.replace('[','')
                term =term.replace(']','')
                term =term.replace("'",'')
                term = term.split(", ")
                if self.extract_claim(term,txt) == True:  # passing keywords and text in extract_claim method
                    item_name.append(claim.loc[i,'modules'])
            return item_name
        except:
            print(f"Unexpected error occurred")
            sys.exit(1)

extract_claim takes keywords and text and returns True if matches are more than 0, else False.

def extract_claim(self,k1, text):
        
        matcher = PhraseMatcher(nlp.vocab, attr='LOWER')
        patterns = [nlp.make_doc(_text) for _text in k1]
        matcher.add("1",None, *patterns)
        doc = nlp(text)
        matches = matcher(doc)
        if len(matches)==0:
            #return doc
            return False
        return True

What I expect:

  1. Text = "I am a very hard working person with good public speaking skills." shall give me item_name = ['strength'] on calling claim_subcat method.
  2. Text = "I procrastinate and often insecure." shall give me item_name = ['weakness'] on calling claim_subcat method.
0

There are 0 answers