Python replace unprintable characters except linebreak

160 views Asked by At

I am trying to write a function that replaces unprintable characters with space, that worked well but it is replacing linebreak \n with space too. I cannot figure out why.

Test code:

import re

def replace_unknown_characters_with_space(input_string):
    # Replace non-printable characters (including escape sequences) with spaces
    # According to ChatGPT, \n should not be in this range
    cleaned_string = re.sub(r'[^\x20-\x7E]', ' ', input_string)

    return cleaned_string

def main():
    test_string = "This is a test string with some unprintable characters:\nHello\x85World\x0DThis\x0Ais\x2028a\x2029test."
    
    print("Original String:")
    print(test_string)
    
    cleaned_string = replace_unknown_characters_with_space(test_string)
    
    print("\nCleaned String:")
    print(cleaned_string)

if __name__ == "__main__":
    main()

Output:

Original String:
This is a test string with some unprintable characters:
Hello
Thisd
is 28a 29test.

Cleaned String:
This is a test string with some unprintable characters: Hello World This is 28a 29test.

As you can see, the linebreak before Hello World is replaced by space, which is not intended. I tried to get help from ChatGPT but its regex solutions don't work.

my last resort is to use a for loop and use python built-in isprintable() method to filter the characters out, but this will be much slower compared to regex.

4

There are 4 answers

0
Electron X On BEST ANSWER

Modified regex expression inspired by Carlo Arenas' answer.

Code:

import re

def replace_unknown_characters_with_space(input_string):
    # Replace all non printable ascii, excluding \n from the expression
    cleaned_string = re.sub(r'[^\x20-\x7E\n]', ' ', input_string, flags=re.DOTALL)

    return cleaned_string

def main():
    test_string = "This is a test string with some unprintable characters:\nHello\x85World\x0DThis\nis\x2028a\x2029test."
    
    print("Original String:")
    print(test_string)
    
    cleaned_string = replace_unknown_characters_with_space(test_string)
    
    print("\nCleaned String:")
    print(cleaned_string)

if __name__ == "__main__":
    main()

Output

Original String:
This is a test string with some unprintable characters:
Hello
Thisd
is 28a 29test.

Cleaned String:
This is a test string with some unprintable characters:
Hello World This
is 28a 29test.

\n is no longer replaced

2
Carlo Arenas On

The question seem to have multiple parts so lets address them independently

Why is '\n' affected?

'\n' is a special character for regular expressions, as they are designed to operate in lines and '\n' indicates the end of the line.

As you had found, you might as well include '\n' in the text you want to match but then need to make the RE engine aware that it shouldn't treat it specially and for that you can use the re.DOTALL flag.

What do you mean by "printable"?

"printable" has a wider meaning than what ChatGPT suggested and that seem to be a translation of the POSIX ASCII class [:print:] (I would have recommended [:graph:] instead) to Python; it would seem from your test that you might had been more interested on removing "funny characters" that would affect the output when printed.

Your test includes characters, probably mistranslated by ChatGPT, for UTF-8 spaces (\s is a better option to identify those in a Regex and Python doesn't use \x for codepoints bigger than 255 but instead uses \u with similar numbers, so they were probably from PCRE syntax)

Since you include UTF-8 spaces, and python strings are UTF-8 it might seem logical to also filter out "funny UTF-8 characters" like the BIDI control class, which would have a similar effect than '\r' if you are planning on printing that string later.

If you consider any non ASCII character as "funny" then the solution will need to change as well.

The following version of your example (with the corrected test text and some expansion) could be considered "correct" but I suspect would require further changes as you refine your requirements.

import re

def replace_unknown_characters_with_space(input_string):
    # Replace non-printable ASCII characters (including escape sequences) with spaces
    cleaned_string = re.sub(r'[^][\w\n!"\#$%&\'()*+,./:;<=>?@\\\^_`{|}~-]', ' ', input_string, flags=re.DOTALL)

    return cleaned_string

def main():
    test_string = "This is a test string with some unprintable characters:\nHello\x85World\x0AThis\x0Dis\u2028a\u2029test. including some punctuation like `({~})' and even \\, and \" + words like <año> or numbers like \u1bb1\nText cant be \033[1m[bold]\033[0m or go \u2067backwards\u2069, but can also contain wide numbers like \uff11 or 0"

    print("Original String:")
    print(test_string)

    cleaned_string = replace_unknown_characters_with_space(test_string)

    print("\nCleaned String:")
    print(cleaned_string)

if __name__ == "__main__":
    main()
3
Diego Torres Milano On

Do the opposite, and skip \x0A

def replace_unknown_characters_with_space(input_string):
    # Replace non-printable characters (including escape sequences) with spaces
    # According to ChatGPT, \n should not be in this range
    cleaned_string = re.sub(r'[^\x00-\x09\x11-\x1F]', ' ', input_string)

    return cleaned_string
2
SIGHUP On

You don't need re for this. You can use built-in functionality.

Just build a translation table and then str.translate()

For example:

TEST_STRING = "This is a test string with some unprintable characters:\nHello\x85World\x0DThis\x0Ais\x2028a\x2029test."
TDICT = {c: " " for c in range(32) if c != 10}
print(TEST_STRING.translate(TDICT))

Output:

This is a test string with some unprintable characters:
Hello
World This
is 28a 29test.

Note:

re is much faster than str.translate once you've discerned the correct regular expression