Is there a regex to grab all quotation marks?

4.6k views Asked by At

I know that in regex, there is \s to match all whitepsaces (space, tabs ...), \d for any number, etc.

Is there the same shortcut to match all different quotation marks: ' " “ ” ‘ ’ „ ” « »?

And more on Wikipedia ...

I can write my own regex, but I will probably miss some quotation marks from other languages, so I like to have a generic way to match all the quotation marks.

But may be they are considered as different characters so that it is impossible?

4

There are 4 answers

0
Stephen C On BEST ANSWER

Is there the same shortcut to match all different quotation marks

There is no such short-cut, in Java ... or (AFAIK) in any other dialect of regexes.

I can write my own regex, but I will probably miss some quotation marks from other languages, so I like to have a generic way to match all the quotation marks.

Unfortunately, there is no Unicode character class that consists of all "quotation" characters.

And there is no simple / guaranteed heuristic based on character names either.

1
marvel308 On

you can use the regex

['"“”‘’„”«»]

see the regex101 demo

0
AudioBubble On

Approach:

If you are not sure about all quotation marks then you can write regex for what you need other than quotation marks. other wise write in this ['"“”‘’„”«»] all possible quotation marks.

2
Joop Eggen On

Java Unicode support has a very detailed support, and even classifies punctuation. However not for quotes. And there are quotes that are neither types as starting or ending quotes. But you can collect them, and generate code. Advantage: completeness.

    for (int cp = 32; cp <= 0xFFFF; ++cp) {
        String name = Character.getName(cp);
        if(name != null && name.contains("QUOTATION")) {
            System.out.printf("\\u%04x = %s (%s %s)%n",
                    cp, name,
                    Character.getType(cp) == Character.INITIAL_QUOTE_PUNCTUATION,
                    Character.getType(cp) == Character.FINAL_QUOTE_PUNCTUATION);
        }
    }

This exploits code points almost being chars. Hence will not work for Asian scripts (stopping at U+FFFF). This results in:

\u0022 = QUOTATION MARK (false false)
\u00ab = LEFT-POINTING DOUBLE ANGLE QUOTATION MARK (true false)
\u00bb = RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK (false true)
\u2018 = LEFT SINGLE QUOTATION MARK (true false)
\u2019 = RIGHT SINGLE QUOTATION MARK (false true)
\u201a = SINGLE LOW-9 QUOTATION MARK (false false)
\u201b = SINGLE HIGH-REVERSED-9 QUOTATION MARK (true false)
\u201c = LEFT DOUBLE QUOTATION MARK (true false)
\u201d = RIGHT DOUBLE QUOTATION MARK (false true)
\u201e = DOUBLE LOW-9 QUOTATION MARK (false false)
\u201f = DOUBLE HIGH-REVERSED-9 QUOTATION MARK (true false)
\u2039 = SINGLE LEFT-POINTING ANGLE QUOTATION MARK (true false)
\u203a = SINGLE RIGHT-POINTING ANGLE QUOTATION MARK (false true)
\u275b = HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT (false false)
\u275c = HEAVY SINGLE COMMA QUOTATION MARK ORNAMENT (false false)
\u275d = HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT (false false)
\u275e = HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT (false false)
\u275f = HEAVY LOW SINGLE COMMA QUOTATION MARK ORNAMENT (false false)
\u2760 = HEAVY LOW DOUBLE COMMA QUOTATION MARK ORNAMENT (false false)
\u276e = HEAVY LEFT-POINTING ANGLE QUOTATION MARK ORNAMENT (false false)
\u276f = HEAVY RIGHT-POINTING ANGLE QUOTATION MARK ORNAMENT (false false)
\u301d = REVERSED DOUBLE PRIME QUOTATION MARK (false false)
\u301e = DOUBLE PRIME QUOTATION MARK (false false)
\u301f = LOW DOUBLE PRIME QUOTATION MARK (false false)
\uff02 = FULLWIDTH QUOTATION MARK (false false)