I'm trying to remove a list of punctuation from my text file but I have only one problem with words separated from hyphen. For example, if I have the word "post-trauma" I get "posttrama" conversely I want to get "post" "trauma".
My code is:
punct=['!', '#', '"', '%', '$', '&', ')', '(', '+', '*', '-']
with open(myFile, "r") as f:
text= f.read()
remove = '|'.join(REMOVE_LIST) #list of word to remove
regex = re.compile(r'('+remove+r')', flags=re.IGNORECASE)
out = regex.sub("", text)
delta= " ".join(out.split())
txt = "".join(c for c in delta if c not in punct )
Is there a way to solve it?
I believe you can just call the built-in
replace
function on delta, so your last line would become the following:This means all the hyphens in your text will become spaces, so the words will be treated as if they were separate.