Regular expression to remove ASCII control characters in Java

225 views Asked by At

I've been reading that the below pattern used as part of String#replaceAll() in Java

"[\\p{Cntrl}&&[^\r\n\t]]"

removes various non-printable ASCII characters from it.

How does one interpret the above incantation:

  • which characters are included as part of those control chars to be removed?
  • what does the && stand for?
  • does ^ mean it only looks at the beginning of the line?

Can someone please provide a comprehensive non-technical explanation of the above expression?

Thank you in advance.

3

There are 3 answers

0
Reilas On BEST ANSWER

"... which characters are included as part of those control chars to be removed? ..."

You can find this information in the Pattern class JavaDoc.

Pattern – POSIX character classes (US-ASCII only) – (Java SE 20 & JDK 20).

\p{Cntrl}    A control character: [\x00-\x1F\x7F]

Which is, from values 0 through 1f, and value 7f.

"... what does the && stand for? ..."

The && is part of the syntax for a character class intersection.

For example, the following will match any character, a through z, except for x and y.

[a-z&&[^xy]]

"... does ^ mean it only looks at the beginning of the line? ..."

Not when within a character class, [ ].

5
DuncG On

The pattern matches a character which is in the set of control characters \\p{Cntrl} intersected (by &&) with the set of characters that are not line break , carriage return or tab [^\r\n\t]. Example:

"a\u0001b\u0002c\rd\te\nf".replaceAll("[\\p{Cntrl}&&[^\r\n\t]]","-");
=> control codes 0001 and 0002 are removed: "a-b-c\rd\te\nf"

To help explain, consider swapping \\p{Cntrl} for [a-z] and [^\r\n\t] with [^aeiou], and then you will have a pattern that can be used to filter out consonants:

"123abcdef".replaceAll("[[a-z]&&[^aeiou]]","-");
=> "123a---e-"
9
phatfingers On

There are a few things going on here that are only available in some flavors of regex. You may encounter differences in implementation or availability in different languages.

Where supported, you can define a character class with multiple classes within it. For example, [[a-z][0-9]] is a valid equivalent to [a-z0-9].

Where the && operator is supported, it can be used to create a character class that is the intersection of two character classes. For example, [[a-z]&&[^d-w]] would be equivalent to [abcxyz].

There are a bunch of predefined character classes that can be referenced with \p{category_name}. In Java, the category \p{Cntrl} represents [\x00-\x1F\x7F]. You can find examples within the Java Docs for java.util.regex.Pattern.
(See : https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html)

So, your regex matches every character in the range [\x00-\x1F\x7F] except for characters [\r\n\t].