Python remove punctuation from a text file

5.3k views Asked by At

I'm trying to remove a list of punctuation from my text file but I have only one problem with words separated from hyphen. For example, if I have the word "post-trauma" I get "posttrama" conversely I want to get "post" "trauma".

My code is:

 punct=['!', '#', '"', '%', '$', '&', ')', '(', '+', '*', '-'] 

 with open(myFile, "r") as f:
      remove = '|'.join(REMOVE_LIST) #list of word to remove
      regex = re.compile(r'('+remove+r')', flags=re.IGNORECASE) 
      out = regex.sub("", text)

      delta= " ".join(out.split())
      txt = "".join(c for c in delta if c not in punct )

Is there a way to solve it?


There are 2 answers

Andrew Dean On BEST ANSWER

I believe you can just call the built-in replace function on delta, so your last line would become the following:

txt = "".join(c for c in delta.replace("-", " ") if c not in punct )

This means all the hyphens in your text will become spaces, so the words will be treated as if they were separate.

Azeros On

The above method might not work as you still remove all the dash ("-") characters from the inital string. If you want it to work, remove it from the list punct. The updated code looks like this:

punct=['!', '#', '"', '%', '$', '&', ')', '(', '+', '*'] 

 with open(myFile, "r") as f:
      remove = '|'.join(REMOVE_LIST) #list of word to remove
      regex = re.compile(r'('+remove+r')', flags=re.IGNORECASE) 
      out = regex.sub("", text)

      delta= " ".join(out.split())
      txt = "".join(c for c in delta.replace("-", " ") if c not in punct )

The problem comes from the fact that you are replacing all the characters in punct with an empty string, and you want a space for the "-". Thus, you need to replace the characters twice (once with empty string, and once with a space).