nltk similarity performance issue?

279 views Asked by At

nltk have nice word2word similarity function which measures similarity by how close the terms are to the common hypernym. Although that similarity function is not applicable to the situation where 2 terms differ from pos tag to pos tag, it still is great.

However, I found that it is so slow... It was 10x times slower than just term matching. Is there anyway the nltk similarity function become faster?

I have tested with this code below:

from nltk import stem, RegexpStemmer
from nltk.corpus import wordnet, stopwords
from nltk.tag import pos_tag
import time

file1 = open('./tester.csv', 'r')

def similarityCal(word1, word2):
  synset1 = wordnet.synsets(word1)
  synset2 = wordnet.synsets(word2)
  if len(synset1) != 0 and len(synset2) != 0:
    wordFromList1 = synset1[0]
    wordFromList2 = synset2[0]
    return wordFromList1.wup_similarity(wordFromList2)
  else:
    return 0


start_time = time.time()
file1lines = file1.readlines()

stopwords = stopwords.words('english')
previousLine = ""
currentLine = ""
cntOri = 0
cntExp = 0

for line1 in file1lines:  
  currentLine = line1.lower().strip()
  if previousLine == "":
    previousLine = currentLine
    continue
  else:
    for tag1 in pos_tag(currentLine.split(" ")):
      tmpStr1 = tag1[0];
      if tmpStr1 not in stopwords and len(tmpStr1) > 1:
        if tmpStr1 in previousLine:
          print("termMatching word", tmpStr1);
          cntOri = cntOri + 1
      for tag2 in pos_tag(previousLine.split(" ")):
        tmpStr2 = tag2[0];
        if tag1[1].startswith("NN") and tag2[1].startswith("NN") or tag1[1].startswith("VB") and tag2[1].startswith("VB"):
          value = similarityCal(tmpStr1, tmpStr2)
          if type(value) is float and value > 0.8:
            print(tmpStr1, " similar to " , tmpStr2 , " ", value)
            cntExp = cntExp + 1
    previousLine = currentLine

end_time = time.time()
print ("time taken : ",end_time - start_time, " // ", cntOri, " | ", cntExp)

file1.close()

I just comment out similarity function to compare the performance.

And I have used samples from this site: https://www.briandunning.com/sample-data/

Any ideas?

0

There are 0 answers