Why can't we get consistent results when using spacy to do stemming/lemmatization?

878 views Asked by At

Here are my python codes:

import spacy
nlp = spacy.load('en')
line = u'Algorithms; Deterministic algorithms; Adaptive algorithms; Something...'
line = line.lower()
print ' '.join([token.lemma_ for token in nlp(line)])

The output is:

algorithm ; deterministic algorithm ; adaptive algorithms ; something...

Why is the third algorithms not converted to 'algorithm'? And when I remove lower() function, I get this:

algorithms ; deterministic algorithms ; adaptive algorithm ; something...

This time the first and second algorithms could not be converted. This problem drives me crazy, how can I fix this to get every single word to be lemmatized?

2

There are 2 answers

1
syllogism_ On

What version are you using? With lower it works correctly for me:

>>> doc = nlp(u'Algorithms; Deterministic algorithms; Adaptive algorithms; Something...'.lower())
>>> for word in doc:
...   print(word.text, word.lemma_, word.tag_)
... 
(u'algorithms', u'algorithm', u'NNS')
(u';', u';', u':')
(u'deterministic', u'deterministic', u'JJ')
(u'algorithms', u'algorithm', u'NNS')
(u';', u';', u':')
(u'adaptive', u'adaptive', u'JJ')
(u'algorithms', u'algorithm', u'NN')
(u';', u';', u':')
(u'something', u'something', u'NN')
(u'...', u'...', u'.')

Without the lower, the tagger assigns Algorithms the tag NNP, i.e. proper noun. This prevents the lemmatisation, because the model has statistically guessed the word is a proper noun.

You can set a special-case rule in the tokenizer to tell spaCy that Algorithms is never a proper noun, if you like.

from spacy.attrs import POS, LEMMA, ORTH, TAG
nlp = spacy.load('en')

nlp.tokenizer.add_special_case(u'Algorithms', [{ORTH: u'Algorithms', LEMMA: u'algorithm', TAG: u'NNS', POS: u'NOUN'}])
doc = nlp(u'Algorithms; Deterministic algorithms; Adaptive algorithms; Something...')
for word in doc:
    print(word.text, word.lemma_, word.tag_)
(u'Algorithms', u'algorithm', u'NNS')
(u';', u';', u':')
(u'Deterministic', u'deterministic', u'JJ')
(u'algorithms', u'algorithm', u'NN')
(u';', u';', u':')
(u'Adaptive', u'adaptive', u'JJ')
(u'algorithms', u'algorithm', u'NNS')
(u';', u';', u':')
(u'Something', u'something', u'NN')
(u'...', u'...', u'.')

The tokenizer.add_special_case function allows you to specify how a string of characters will be tokenized, and set attributes on each of the subtokens.

0
Mohammad Yusuf On

I think syllogism_'s explained it better. But here's another way:

from nltk.stem import WordNetLemmatizer

lemma = WordNetLemmatizer()
line = u'Algorithms; Deterministic algorithms; Adaptive algorithms; Something...'.lower().split(';')
line = [a.strip().split(' ') for a in line]
line = [map(lambda x: lemma.lemmatize(x), l1) for l1 in line ]
print line

Output:

[[u'algorithm'], [u'deterministic', u'algorithm'], [u'adaptive', u'algorithm'], [u'something...']]