I'm trying to get lemmas (i.e. token.lemma_) for all tokens in a document using spacy.
CODE:
sentence = 'I'm looking for all of the lemmas. Please help me find them!'
nlp = spacy.load('en', disable=['parser', 'NER])
doc = nlp(sentence)
tokens = [tokens.lemma_ for token in doc]
EXPECTED RESULT:
['look', 'lemma', 'help', 'find']
ACTUAL RESULT:
[-PRON-, 'be', 'look', 'all', 'of', 'the', 'lemma', '.', 'please', 'help', '-PRON-', 'find', '-PRON', '!']
Am I missing some sort of preprocessing function in spacy, or do I have to preprocess separately? I want all punctuation and stopwords to be removed ahead of lemmatization.
You can use
The following parts have been added:
if not token.is_stop
- if the token is a stopwordand
- andnot token.is_punct
- if the token is punctuation, omit them.