Regex to match binary literal number in re2c format

70 views Asked by At

I'm trying to create a regex, in RE2C's regex format (1), for matching binary literal numbers. They should look like:

  • 0b1, 0b101, 0b1111, 0b11_11, 0b1_111, 0b1_1_1_1, etc

The underscore is used as a convenience separator, and it is otherwise ignored when extracting the resulting digits. However, the separator should only be used in between digits (not at beginning nor end), and there should not be 2 or more consecutive underscores.

This is the regex that makes sense to me:

BINARY_NUM = "0b" ("0"|"1") ("_"? ("0"|"1"))*;

I'm trying to say:

  • it starts with "0b"
  • it is followed by one digit of 0 or 1.
  • it is followed by any amount of any of these combinations: "0", "1", "_0", "_1"

However, the regex above also matches a trailing "_". So, these 2 are matched equivalently:

  • 0b1_0
  • 0b1_0_

How can I prevent the matching of trailing "_"?

2

There are 2 answers

3
jhnc On

Based purely on this documentation, a form of lookahead is supported. So one might hope that this would work:

BINARY_NUM = "0b" ("0"|"1") ("_"? ("0"|"1"))* / [^_01] ;

or slightly more compactly:

BINARY_NUM = "0b" [01]+ ( "_" [01]+ )* / [^_01] ;

although something more sophisticated would be needed, since the example above implies that:

0b010101010222

would be parsed as 0b010101010 followed by 222.

However, I discovered "trailing contexts are not allowed in named definitions" when I tried substituting the above into the introductory sample code in the manual.

Modifying it with the "sentinel" example, I get:

// re2c $INPUT -o $OUTPUT -i --case-ranges
#include <assert.h>

bool lex(const char *s) {
    const char *YYCURSOR = s;
    const char *YYMARKER;
    for(;;) {
    /*!re2c
        re2c:yyfill:enable = 0;
        re2c:define:YYCTYPE = char;

        number = "0b" [01]+ ( "_" [01]+ )*;

        number { continue; }
        [\x00] { return true; }
        *      { return false; }
    */
    }
}

int main() {
    assert(lex("0b01_001"));
    assert(lex("0b00000_"));
    return 0;
}

This successfully rejects trailing _.

It is also possible to just include the null sentinel directly:

number = "Ob" [01]+ ("_"[01]+)* "\x00";
// re2c $INPUT -o $OUTPUT -i --case-ranges
#include <assert.h>

bool lex(const char *s) {
    const char *YYCURSOR = s;
    const char *YYMARKER;
    /*!re2c
        re2c:yyfill:enable = 0;
        re2c:define:YYCTYPE = char;

        number = "0b" [01]+ ( "_" [01]+ )* "\x00";

        number { return true; }
        *      { return false; }
    */
    }
}

int main() {
    assert(lex("0b01_001"));
    assert(lex("0b00000_"));
    return 0;
}
8
Bohemian On

In regular regex style:

^0b([01]_?)*[01]$

The regex:

  • 0b literal "0b"
  • [01]_? binary digit optionally followed by an underscore
  • (...)* zero or more of
  • [01] binary digit

See live demo.

Expressed as RE2C (I think):

BINARY_NUM = "0b" ([01]"_"?)* [01];