Code so far:
import glob
import re
words = [x.strip () for x in open('words.txt').read().split('\n') if x]
paths = glob.glob('./**/*.text', recursive=True)
for path in paths:
with open(path, "r+") as file:
s = file.read()
for word in words:
s = re.sub(word, 'random_text', s)
file.seek(0)
file.write(s)
file.truncate()
I need to loop through file paths, scan each file for words and replace each word found with some text. Just to be clear, this code works, its just so slow (takes well over an hour) as there are around 23k words and 14k files. Could you please give me recommendations for speeding it up?
I've looked at map() and zip() functions, but I don't think that's what I need (could be wrong). I've also looked at threading & multiprocessing but not sure how to implement in this case. I've tried doing this in bash too with 'sed' but that too takes very long and hit the same problem of nested loops. Thanks in advance for the help! (I'm pretty new to coding so go easy on me! :))
I think you can remove the second for-loop and avoid compiling the regex every time by pre-compiling it. I'm far from experienced at optimizing code, but this is where I would start.
This has limited applicability. If you want to change
'random_text'
depending on what word you're replacing this won't work.