Parsing blocks of line comments using MGrammar

162 views Asked by At

How can I parse blocks of line comments with MGrammar?

I want to parse blocks of line comments. Line comments that are next to each should grouped in the MGraph output.

I'm having trouble grouping blocks of line comments together. My current grammar uses "\r\n\r\n" to terminate a block but that will not work in all cases such as at end of file or when I introduce other syntaxes.

Sample input could look like this:

/// This is block
/// number one

/// This is block
/// number two

My current grammar looks like this:

module MyModule
{
    language MyLanguage
    {       
        syntax Main = CommentLineBlock*;

        token CommentContent = !(
                                 '\u000A' // New Line
                                 |'\u000D' // Carriage Return
                                 |'\u0085' // Next Line
                                 |'\u2028' // Line Separator
                                 |'\u2029' // Paragraph Separator
                                );   

        token CommentLine = "///" c:CommentContent* => c;
        syntax CommentLineBlock = (CommentLine)+ "\r\n\r\n";

        interleave Whitespace = " " | "\r" | "\n";   
    }
}
1

There are 1 answers

0
Lars Corneliussen On

The Problem is, that you interleave all whitespaces - so after parsing the tokens and coming to the lexer, they just "don't exist" anymore.

CommentLineBlock is syntax in your case, but you need the comment-blocks to be completely consumed in tokens...

language MyLanguage
{       
    syntax Main = CommentLineBlock*;

    token LineBreak = '\u000D\u000A'
                         | '\u000A' // New Line
                         |'\u000D' // Carriage Return
                         |'\u0085' // Next Line
                         |'\u2028' // Line Separator
                         |'\u2029' // Paragraph Separator
                        ;  

    token CommentContent = !(
                             '\u000A' // New Line
                             |'\u000D' // Carriage Return
                             |'\u0085' // Next Line
                             |'\u2028' // Line Separator
                             |'\u2029' // Paragraph Separator
                            );   

    token CommentLine = "//" c:CommentContent*;
    token CommentLineBlock = c:(CommentLine LineBreak?)+ => Block {c};

    interleave Whitespace = " " | "\r" | "\n";   
}

But then the problem is, that the subtoken-rules in CommentLine won't be processed - you get plain strings parsed.

Main[
  [
    Block{
      "/// This is block\r\n/// number one\r\n"
    },
    Block{
      "/// This is block\r\n/// number two"
    }
  ]
]

I might try to find a nicer way tonight :-)