Is there any way to prevent my WordNetLemmatizer from lemmatizing contracted words like "can't" or "didn't"?

370 views Asked by At

The code below is what I currently have, which works fine but it changes words like "didn't" into "didn" and "t". I would like for it to either remove the apostrophe so it would come out as "didnt" or just leave it as "didn't" though that may result in issues later with TfidfVectorizer?

Is there any way to implement this without too much of a hassle?

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

lemmatizer = WordNetLemmatizer()

def lemmatize_review(review):
    """Lemmatize single review string"""
    lemmatized_review = ' '.join([lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in word_tokenize(review)])
    return lemmatized_review

review_data['Lemmatized_Review'] = review_data['Review'].apply(lemmatize_review)
2

There are 2 answers

0
qaiser On

you can use tweettokenizer instead of word tokenizer

from nltk.tokenize import TweetTokenizer

str = "didn't can't won't how are you"
tokenizer = TweetTokenizer()

tokenizer.tokenize(str)
#op
["didn't", "can't", "won't", 'how', 'are', 'you']
0
Kedaar Rao On

You can just replace the "'" character with and empty character "" before proceeding with lemmatization as shown below:

>>> word = "didn't can't won't"
>>> word
"didn't can't won't"
>>> x = word.replace("'", "")
>>> x
'didnt cant wont'