The code below is what I currently have, which works fine but it changes words like "didn't" into "didn" and "t". I would like for it to either remove the apostrophe so it would come out as "didnt" or just leave it as "didn't" though that may result in issues later with TfidfVectorizer?
Is there any way to implement this without too much of a hassle?
def get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
lemmatizer = WordNetLemmatizer()
def lemmatize_review(review):
"""Lemmatize single review string"""
lemmatized_review = ' '.join([lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in word_tokenize(review)])
return lemmatized_review
review_data['Lemmatized_Review'] = review_data['Review'].apply(lemmatize_review)
you can use tweettokenizer instead of word tokenizer