I want to use POS-labelling and lemmatisation on my text data. I've found this example code from kaggle. This applies it to a sentence, but I want to modify this code in order to apply it to a column of a dataframe.
#Kaggle example code:
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
def penn2morphy(penntag):
""" Converts Penn Treebank tags to WordNet. """
morphy_tag = {'NN':'n', 'JJ':'a',
'VB':'v', 'RB':'r'}
try:
return morphy_tag[penntag[:2]]
except:
return 'n' # if mapping isn't found, fall back to Noun.
# `pos_tag` takes the tokenized sentence as input, i.e. list of string,
# and returns a tuple of (word, tg), i.e. list of tuples of strings
# so we need to get the tag from the 2nd element.
walking_tagged = pos_tag(word_tokenize('He is walking to school'))
#print(walking_tagged)
testing["text"].apply(penn2morphy)
#[wnl.lemmatize(word.lower(), pos=penn2morphy(tag)) for word, tag in walking_tagged]
I presumed you would just use the apply function but that doesnt work. The first line where the pos_tag is being applied is just labelling each row as n, so i presume it isnt iterating through each row.
#Example data
r1 = ["he, has, a, glass, of, water, together, with, a, mirror"],"Pass"
r2 = ["lamp, lens, right, left"], "Fail"
r3 = ["candle, clock, vase, spoon"], "Fail"
d=(r1,r2,r3)
ex_df = pd.DataFrame(d, columns=["col1", "col2"])
walking_tagged2 = ex_df["col1"].apply(pos_tag)
[wnl.lemmatize(word.lower(), pos=penn2morphy(tag)) for word, tag in walking_tagged2]
Any ideas? Thank you
Take a look at https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L124
After you have a
lemmatize_sentencefunction like above,[out]:
With dataframes: