I have a large dataset all_transcripts with almost 3 million rows. One of the columns msgText contains written messages.

>>> all_transcripts['msgText']

['this is my first message']
['second message is here']
['this is my third message']

Furthermore, I have a list with 200+ words, called gemeentes.

>>> gemeentes
['first','second','third' ... ]

If a word in this list is contained in msgText, I want to replace it by another word. To do so, I created the function:

def replaceCity(text):
    newText = text.replace(plaatsnaam, 'woonplaats')
    return str(newText)

So, my desired output would look like:

['this is my woonplaats message']
['woonplaats message is here']
['this is my woonplaats message']

Currently, I am looping through the list and for every item in my list, apply the replaceCityfunction.

for plaatsnaam in gemeentes:
    global(plaatsnaam)
    all_transcripts['filtered_text'] = test.msgText.apply(replaceCity)

However, this takes very long, so does not seem to be efficient. Is there a faster way to perform this task?


This post (Algorithm to find multiple string matches) is similar, however my problem is different because:

  • here there is only one big piece small of text, while I have a dataset with many different rows

  • I want to replace words, rather than just finding the words.

1 Answers

1
Dan On Best Solutions

Assuming all_transcripts is a pandas DataFrame:

all_transcripts['msgText'].str.replace('|'.join(gemeentes),'woonplaats')

Example:

all_transcripts = pd.DataFrame([['this is my first message'],
                                ['second message is here'],
                                ['this is my third message']],
                               columns=['msgText'])
gemeentes = ['first','second','third']

all_transcripts['msgText'].str.replace('|'.join(gemeentes),'woonplaats')

outputs

0    this is my woonplaats message
1       woonplaats message is here
2    this is my woonplaats message