Am I missing the preprocessing function in spaCy's lemmatization?

Question

Am I missing the preprocessing function in spaCy's lemmatization?

579 views Asked by hanreli At 03 October 2020 at 15:24

I'm trying to get lemmas (i.e. token.lemma_) for all tokens in a document using spacy.

CODE:

sentence = 'I'm looking for all of the lemmas. Please help me find them!'
nlp = spacy.load('en', disable=['parser', 'NER])
doc = nlp(sentence)
tokens = [tokens.lemma_ for token in doc]

EXPECTED RESULT:

['look', 'lemma', 'help', 'find']

ACTUAL RESULT:

[-PRON-, 'be', 'look', 'all', 'of', 'the', 'lemma', '.', 'please', 'help', '-PRON-', 'find', '-PRON', '!']

Am I missing some sort of preprocessing function in spacy, or do I have to preprocess separately? I want all punctuation and stopwords to be removed ahead of lemmatization.

Original Q&A

There are 1 answers

**Wiktor Stribiżew** · Accepted Answer · 2020-10-03T19:30:50+00:00

Wiktor Stribiżew On 03 October 2020 at 19:30 BEST ANSWER

You can use

>>> [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
['look', 'lemma', 'help', 'find']

The following parts have been added:

if not token.is_stop - if the token is a stopword
and - and
not token.is_punct - if the token is punctuation, omit them.

TechQA.

Am I missing the preprocessing function in spaCy's lemmatization?

There are 1 answers

Related Questions in PYTHON

Related Questions in SPACY

Related Questions in LEMMATIZATION

Popular Questions

Popular Tags

Trending Questions