How to optimize nested loops in python3?

143 views Asked by At

Code so far:

import glob
import re

words = [x.strip () for x in open('words.txt').read().split('\n') if x]
paths = glob.glob('./**/*.text', recursive=True)

    for path in paths:
        with open(path, "r+") as file:
            s = file.read()
            for word in words:
                s = re.sub(word, 'random_text', s)
                file.seek(0)
                file.write(s)
                file.truncate()

I need to loop through file paths, scan each file for words and replace each word found with some text. Just to be clear, this code works, its just so slow (takes well over an hour) as there are around 23k words and 14k files. Could you please give me recommendations for speeding it up?

I've looked at map() and zip() functions, but I don't think that's what I need (could be wrong). I've also looked at threading & multiprocessing but not sure how to implement in this case. I've tried doing this in bash too with 'sed' but that too takes very long and hit the same problem of nested loops. Thanks in advance for the help! (I'm pretty new to coding so go easy on me! :))

2

There are 2 answers

0
Jacinator On

I think you can remove the second for-loop and avoid compiling the regex every time by pre-compiling it. I'm far from experienced at optimizing code, but this is where I would start.

import glob
import re

words = [x.strip() for x in open('words.txt').read().split('\n') if x]
paths = glob.glob('./**/*.text', recursive=True)

regex = re.compile('|'.join(words))

for path in paths:
    with open(path, 'r+') as file:
        contents = file.read()
        contents = regex.sub('random_text', contents)
        file.seek(0)
        file.write(contents)
        file.truncate()

This has limited applicability. If you want to change 'random_text' depending on what word you're replacing this won't work.

0
Vahagn Tumanyan On

In addition to @Jacinator's great answer, using multiple processes will actually enhance your runtime.

import glob
import re

from concurrent.futures import ProcessPoolExecutor
words = [x.strip() for x in open('words.txt').read().split('\n') if x]
regex = re.compile('|'.join(words))
def replace_in_one_file(path):
    contents = file.read()
    contents = regex.sub('random_text', contents)
    file.seek(0)
    file.write(contents)
    file.truncate()
if __name__ == '__main__': 
    paths = glob.glob('./**/*.text', recursive=True)

    paths = glob.glob('./**/*.text', recursive=True)

    
    with ProcessPoolExecutor(max_workers = 10) as executor:
        executor.map(replace_in_one_file, paths)