I have this line to remove all non-alphanumeric characters except spaces

re.sub(r'\W+', '', s)

Although, it still keeps non-English characters.

For example if I have

re.sub(r'\W+', '', 'This is a sentence, and here are non-english 托利 苏 !!11')

I want to get as output:

> 'This is a sentence and here are non-english  11'

2 Answers

5
Nir Levy On Best Solutions
re.sub(r'[^A-Za-z0-9 ]+', '', s)

(Edit) To clarify: The [] create a list of chars. The ^ negates the list. A-Za-z are the English alphabet and is space. For any one or more of these (that is, anything that is not A-Z, a-z, or space,) replace with the empty string.

0
LogicalBranch On

I once had this exact problem, the only difference was that I wasn't able to import anything or use regex.

To solve my problem I created a list containing all of the values I wanted to keep:

values = list("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 ")

Then I created a function that would loop through each item in the string and if it wasn't in the values list, it'd remove (replace) it from the string:

def remover(my_string = ""):
  for item in my_string:
    if item not in values:
      my_string = my_string.replace(item, "")
  return my_string

For example, the following code:

print(remover("H!e£l$l%o^ W&o*r(l)d!:)"))

Should output:

'Hello World'

Sure this isn't the best way to do this but given the circumstances, it was a quick and easy way to get job done.

NOTE: you can replace the items that are in the values list by changing if item not in values to if item in values.

NOTE: I wasn't allowed to use string constants because the string package has to be imported to use them.

Good luck.