How to write a peg used in lpeg to parse lua itself?

442 views Asked by At

As title say, I know lua has a offical extended BNF in The Complete Syntax of Lua. I want to write a PEG to pass to lpeg.re.compile to parse lua itself. Maybe the Lua PEG is something like it's BNF. I have read the BNF and try to translate it to a PEG, but I found Numeral and LiteralString it hard to write. Is there someone had do something like this?

local lpeg = require "lpeg"
local re = lpeg.re

local p = re.compile([[
    chunk <- block
    block <- stat * retstat ?
    stat <- ';' /
            varlist '=' explist /
            functioncall /
            label /
            'break' /
            'goto' Name /
            'do' block 'end' /
            'while' exp 'do' block 'end' /
            'repeat' block 'until' exp /
            'if' exp 'then' block ('elseif' exp 'then' block) * ('else' block) ? 'end' /
            'for' Name '=' exp ',' exp (',' exp) ? 'do' block 'end' /
            'for' namelist 'in' explist 'do' block 'end' /
            'function' funcname funcbody /
            'local function' Name funcbody /
            'local' attnamelist ('=' explist) ?
    attnamelist <- Name attrib (',' Name attrib) *
    attrib <- ('<' Name '>') ?
    retstat <- 'return' explist ? ';' ?
    label <- '::' Name '::'
    funcname <- Name ('.' Name) * (':' Name) ?
    varlist <- var (',' var) *
    var <- Name / prefixexp '[' exp ']' / prefixexp '.' Name
    namelist <- Name (',' Name) *
    explist <- exp (',' exp) *
    exp <- 'nil' / 'false' / 'true' / Numeral / LiteralString / "..." / functiondef /
           prefixexp / tableconstructor / exp binop exp / unop exp
    prefixexp <- var / functioncall / '(' exp ')'
    functioncall <- prefixexp args / prefixexp ":" Name args
    args <- '(' explist ? ')' / tableconstructor / LiteralString
    functiondef <- 'function' funcbody
    funcbody <- '(' parlist ? ')' block 'end'
    parlist <- namelist (',' '...') ? / '...'
    tableconstructor <- '{' fieldlist ? '}'
    fieldlist <- field (fieldsep field) * fieldsep ?
    field <- '[' exp ']' '=' exp / Name '=' exp / exp
    fieldsep <- ',' / ';'
    binop <- '+' / '-' / ‘*’ / '/' / '//' / '^' / '%' /
             '&' / '~' / '|' / '>>' / '<<' / '..' /
             '<' / '<=' / '>' / '>=' / '==' / '~=' /
             'and' / 'or'
    unop <- '-' / 'not' / '#' / '~'

    saveword <- "and" / "break" / "do" / "else" / "elseif" / "end" /
                "false" / "for" / "function" / "goto" / "if" / "in" /
                "local" / "nil" / "no"t / "or" / "repeat" / "return" /
                "then" / "true" / "until" / "while"
    Name <- ! saveword / name
    Numeral <- 
    LiteralString <- 
]])
1

There are 1 answers

0
Luatic On

First off: You need to parse Lua in a two step process consisting of tokenization (lexical analysis, RegEx) and parsing (syntactical analysis, CFGs). Consider the syntactically invalid Lua code if1then print()end. If you just parse this in one go, you might not get a syntax error, as theoretically it could reasonably be interpreted as if - number 1 - then ... - tokenization however would greedily make if1 a single "identifier"/name token, triggering a syntax error in the syntactical analysis later on.

PEGs might allow to express this in some cases through their ordered choice, but generally the two-step process should be applied in order to not obtain an overly permissive (and possibly ambiguous grammar).

The "rules" still left to be written are all token rules (as can be seen from the capitalized names) - Name, LiteralString and Numeral. These are basically just simple RegExes. As for Names: If you use the ordered choice + of PEGs cleverly, you don't have to use the "subtraction" (negative lookahead) to avoid keywords being parsed as Names: Just do something along the lines of Token = Keyword + Name + ... in your tokenization grammar.

Literal strings are indeed tricky because of long strings which can't be written as RegExes; quoted strings are rather easy (you have to deal with escapes though). The LPeg docs have an example concerning long strings:

equals = lpeg.P"="^0
open = "[" * lpeg.Cg(equals, "init") * "[" * lpeg.P"\n"^-1
close = "]" * lpeg.C(equals) * "]"
closeeq = lpeg.Cmt(close * lpeg.Cb("init"), function (s, i, a, b) return a == b end)
string = open * lpeg.C((lpeg.P(1) - closeeq)^0) * close / 1

Numerals are a bit clunky because you have to deal with many different cases for different bases, omission of the dot, omission of 0 before or after the dot, exponents, signs etc.

I happen to have the relevant LPeg rules for these lying around:

-- Character classes
_letter = R("AZ", "az")
_letter_ = _letter + P"_"
_digit = R"09"
_hexdigit = _digit + R("AF", "af")
white = C(S" \f\t\v\n\r" ^ 1)
_keyword = P"not"
    + P"and"
    + P"or"
    + P"function"
    + P"nil"
    + P"false"
    + P"true"
    + P"return"
    + P"goto"
    + P"do"
    + P"end"
    + P"while"
    + P"repeat"
    + P"until"
    + P"if"
    + P"then"
    + P"elseif"
    + P"else"
    + P"for"
    + P"local"
-- Names
Name = C(_letter_ * (_letter_ + _digit) ^ 0) - _keyword
-- Numbers
local function _numeral(digit_, exponent_letter)
    local uint = digit_ ^ 1
    local float = uint + uint * P"." * uint + uint * P"." + P"." * uint
    local exponent = exponent_letter * O(S"+-") * uint
    return C(float) * C(O(exponent))
end
_hex_numeral = C(P"0x") * _numeral(_hexdigit, S"pP")
_decimal_numeral = _numeral(_digit, S"eE")
Numeral = _hex_numeral + _decimal_numeral
-- Strings
decimal_escape = C(_digit * O(_digit * O(_digit)))
hex_escape = P"x" * C(_hexdigit * _hexdigit)
unicode_escape = P"u{" * C(_hexdigit^1) * P"}"
char_escape = C(S[[abfnrtv\'"]])
_escape = P[[\]] * (decimal_escape + hex_escape + char_escape + unicode_escape)
local function string_quoted(quotes)
    local range = P(1) - S(quotes .. "\0\n\r\\")
    local content = (_escape + C(range^1)) ^ 0
    return Cg(P(quotes), "quotes") * content * P(quotes)
end
local equals = P"=" ^ 0
local open = P"[" * Cg(equals, "equals") * P"[" * O(P"\n")
local close = Cmt(P"]" * C(equals) * P"]" * Cb"equals", function(_, _, open, close)
    return open == close
end)
_long_string = open * C((P(1) - close) ^ 0) * close
String = string_quoted[[']] + string_quoted[["]] + _long_string
_line_comment = -open * ((P(1) - P"\n") ^ 0) * (P"\n" + _eof)
Comment = P"--" * (_long_string + _line_comment)

You might want to localize some (if not all) of these variables unless you load the rules in a custom environment.