I am using named capturing groups using (?P<name>)
with a list of verbs and word stems related to the coronavirus pandemic.
import regex
import pandas as pd
data = {'id':[1, 2, 3, 4, 5], 'text':['The pandemy is spreading', 'He is fighting Covid-19', 'The pandemic virus spreads', 'This sentence is about a different topic' , 'How do we stop the virus ?']}
df = pd.DataFrame(data)
def covid_lang(text):
predicates = ['avoid', 'contain', 'track', 'spread', 'contact', 'stop', 'combat', 'fight']
subjects = ['Corona', 'corona', 'Covid-19', 'epidem', 'infect', 'virus', 'pandem', 'disease', 'outbreak']
p1 = fr'(?<=\b(?P<predicate>{"|".join(predicates)}))[^\.]*(?P<subject>{"|".join(subjects)}[a-z]*)'
result = []
for m in regex.finditer(p1, text, regex.S):
result.append([m.group('predicate'), m.group('subject')])
p2 = fr'\b(?P<subject>{"|".join(subjects)})[^\.]*(?<=\b(?P<predicate>{"|".join(predicates)}))'
for m in regex.finditer(p2, text, regex.S):
result.append([m.group('subject'), m.group('predicate')])
return result
df['result'] = df['text'].apply(covid_lang)
When there is a match, I would like to return as subject, not only the stem of the word, but the whole word (i.e 'pandemic' and 'pandemy' instead of 'pandem'). I have tried adding [a-z]*
right after the list of words, so that the capturing group stops when the word ends, but it does not change anything.
Plus, is it possible to join the two queries (predicate before subject, subject before predicate) in a single query ? I've tried using (p1)|(p2)
but it didn't work with named captured groups.
Lastly, is it possible to include uppercase and lowercase letters like Corona
and corona
in a single word ?
This should do all three:
Output:
But I'm not sure whether you always want to output the predicate first? If not, this should do it:
Output: