I have a large dataset
all_transcripts with almost 3 million rows. One of the columns
msgText contains written messages.
>>> all_transcripts['msgText'] ['this is my first message'] ['second message is here'] ['this is my third message']
Furthermore, I have a list with 200+ words, called
>>> gemeentes ['first','second','third' ... ]
If a word in this list is contained in
msgText, I want to replace it by another word. To do so, I created the function:
def replaceCity(text): newText = text.replace(plaatsnaam, 'woonplaats') return str(newText)
So, my desired output would look like:
['this is my woonplaats message'] ['woonplaats message is here'] ['this is my woonplaats message']
Currently, I am looping through the list and for every item in my list, apply the
for plaatsnaam in gemeentes: global(plaatsnaam) all_transcripts['filtered_text'] = test.msgText.apply(replaceCity)
However, this takes very long, so does not seem to be efficient. Is there a faster way to perform this task?
This post (Algorithm to find multiple string matches) is similar, however my problem is different because:
here there is only one big piece small of text, while I have a dataset with many different rows
I want to replace words, rather than just finding the words.