Here are my python codes:
import spacy
nlp = spacy.load('en')
line = u'Algorithms; Deterministic algorithms; Adaptive algorithms; Something...'
line = line.lower()
print ' '.join([token.lemma_ for token in nlp(line)])
The output is:
algorithm ; deterministic algorithm ; adaptive algorithms ; something...
Why is the third algorithms
not converted to 'algorithm'?
And when I remove lower()
function, I get this:
algorithms ; deterministic algorithms ; adaptive algorithm ; something...
This time the first and second algorithms
could not be converted.
This problem drives me crazy, how can I fix this to get every single word to be lemmatized?
What version are you using? With
lower
it works correctly for me:Without the
lower
, the tagger assignsAlgorithms
the tag NNP, i.e. proper noun. This prevents the lemmatisation, because the model has statistically guessed the word is a proper noun.You can set a special-case rule in the tokenizer to tell spaCy that
Algorithms
is never a proper noun, if you like.The
tokenizer.add_special_case
function allows you to specify how a string of characters will be tokenized, and set attributes on each of the subtokens.