Find the percent of tokens shared by two documents with spacy

1k views Asked by At

for nltk it would be something like:

def symm_similarity(textA,textB):
    textA = set(word_tokenize(textA))
    textB = set(word_tokenize(textB))    
    intersection = len(textA.intersection(textB))
    difference = len(textA.symmetric_difference(textB))
    return intersection/float(intersection+difference) 

Since spacy is faster, im trying to do it in spacy, but the token objects don't seem to offer a quick solution to this. Any ideas?

Thanks all.

1

There are 1 answers

1
syllogism_ On BEST ANSWER

Your function gets the percentage of word types shared, not tokens. You're taking the set of words, without sensitivity to their counts.

If you want counts of tokens, I expect the following to be very fast, so long as you have the vocabulary file loaded (which it will be by default, if you have the data installed):

from spacy.attrs import ORTH

def symm_similarity_types(nlp, textA,textB):
    docA = nlp.make_doc(textA)
    docB = nlp.make_doc(textB)
    countsA = Counter(docA.count_by(ORTH))
    countsB = Counter(docB.count_by(ORTH)
    diff = sum(abs(val) for val in (countsA - countsB).values())
    return diff / (len(docA) + len(docB))

If you want to compute exactly the same thing as your code above, here's the spaCy equivalent. The Doc object lets you iterate over Token objects. You should then base your counts on the token.orth attribute, which is the integer ID of the string. I expect working with integers will be a bit faster than sets of strings:

def symm_similarity_types(nlp, textA,textB):
    docA = set(w.orth for w in nlp(textA)
    docB = set(w.orth for w in nlp(textB) 
    intersection = len(textA.intersection(textB))
    difference = len(textA.symmetric_difference(textB))
    return intersection/float(intersection+difference)

This should be a bit more efficient than the NLTK version, because you're working with sets of integers, not strings.

If you're really concerned for efficiency, it's often more convenient to just work in Cython, instead of trying to guess what Python is doing. Here's the basic loop:

# cython: infer_types=True
for token in doc.c[:doc.length]
    orth = token.lex.orth

doc.c is a TokenC*, so you're iterating over contiguous memory and dereferencing a single pointer (token.lex is a const LexemeC*)