for nltk it would be something like:
def symm_similarity(textA,textB):
textA = set(word_tokenize(textA))
textB = set(word_tokenize(textB))
intersection = len(textA.intersection(textB))
difference = len(textA.symmetric_difference(textB))
return intersection/float(intersection+difference)
Since spacy is faster, im trying to do it in spacy, but the token objects don't seem to offer a quick solution to this. Any ideas?
Thanks all.
Your function gets the percentage of word types shared, not tokens. You're taking the set of words, without sensitivity to their counts.
If you want counts of tokens, I expect the following to be very fast, so long as you have the vocabulary file loaded (which it will be by default, if you have the data installed):
If you want to compute exactly the same thing as your code above, here's the spaCy equivalent. The
Doc
object lets you iterate overToken
objects. You should then base your counts on thetoken.orth
attribute, which is the integer ID of the string. I expect working with integers will be a bit faster than sets of strings:This should be a bit more efficient than the NLTK version, because you're working with sets of integers, not strings.
If you're really concerned for efficiency, it's often more convenient to just work in Cython, instead of trying to guess what Python is doing. Here's the basic loop:
doc.c
is aTokenC*
, so you're iterating over contiguous memory and dereferencing a single pointer (token.lex
is aconst LexemeC*
)