Following the examples from documentation regarding tokenization I have the following code:
import spacy
from spacy.symbols import ORTH, NORM
nlp = spacy.load("en_core_web_sm")
special_case = [{ORTH: "gim", NORM: "give"}, {ORTH: "me"}]
nlp.tokenizer.add_special_case("gimme", special_case)
doc = nlp("gimme that. he gave me that. Going to someplace.")
Then I check the tokenization
doc[0].norm_ # 'give' (as expected)
But the lemmatizer does not return the same output
lemmatizer = nlp.get_pipe("lemmatizer")
lemmatizer.lemmatize(doc[0]) # ['gim'] (expected ['give']
In other hand
lemmatizer.lemmatize(doc[5]) # ['give']
lemmatizer.lemmatize(doc[9]) # [go']
What I'm doing wrong? How to "fix"? In spaCy what is the difference between normalized tokens and lemmatized tokens? How can I "teach" the lemmatization of a single token (as this gim token in example) ?
In your code you've customized the tokenizer to handle the special case "gimme" and normalize it to "give.
Here's how you can achieve consistent lemmatization results with your custom normalization