NLTK lemmatizer changing "less" to "le". Text doesn't make sense anymore

135 views Asked by At
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('Less'.lower())

'le'

What's going on here, and how can I avoid this?

The word 'le' is now appearing all over my LDA topic model, and it doesn't make sense.

Who knows what other words it is affecting in the model. Should I avoid using the Lemmatizer or is there a way to fix this?

1

There are 1 answers

0
Maciej Skorski On

I will give more context in addition to the observation in comments. They key is to understand lemmatiziation rules. They depend on the part of speech. Your word is considered a noun (default) and gets its supposed plural suffix stripped twice. Similarly as with the noun mess or its misspeling mes.

from nltk.stem import WordNetLemmatizer
word = 'mes'
wnl = WordNetLemmatizer()
wnl.lemmatize(word) # me

In your case, the right option is (as in the comments)

word = 'less'
wnl = WordNetLemmatizer()
wnl.lemmatize(word, 'a') # less

More: the rules are

from nltk.corpus.reader import WordNetCorpusReader
WordNetCorpusReader.MORPHOLOGICAL_SUBSTITUTIONS
{'n': [('s', ''),
  ('ses', 's'),
  ('ves', 'f'),
  ('xes', 'x'),
  ('zes', 'z'),
  ('ches', 'ch'),
  ('shes', 'sh'),
  ('men', 'man'),
  ('ies', 'y')],
 'v': [('s', ''),
  ('ies', 'y'),
  ('es', 'e'),
  ('es', ''),
  ('ed', 'e'),
  ('ed', ''),
  ('ing', 'e'),
  ('ing', '')],
 'a': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')],
 'r': [],
 's': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')]}

For the whole algorithm, see the source code of WordNetLemmatizer.lemmatize.