How to handle inline comments within atomic rules in a Pest grammar?

47 views Asked by At

I'm writing a pest grammar for the fountain.io syntax, which includes two different varieties of comment-like elements: [[ notes ]] set off with double-square brackets and /* boneyard */ elements delimited with a C-style comment syntax.

I'm trying to satisfy the test cases for Boneyards and Notes from the reference implementation, and I'm running into problems with the internal cases.

By way of example:

file = { ( boneyard | note | generic_line | blank_line)+ ~ EOI }

// Basics
ws         = _{ (" " | "\t") }
start      = _{ SOI | NEWLINE }
bn_open    =  { ("[[" | "/*") }
blank_line =  { NEWLINE }
head       =  { !(NEWLINE | ws | bn_open) ~ ANY }
tail       =  { !(NEWLINE | bn_open) ~ ANY }
text       =  { bn? ~ head ~ bn? ~ (tail ~ bn?)* }

generic_line = { start ~ text ~ &NEWLINE }

// Boneyards & Notes

bn           =  { boneyard | note }
boneyard     =  { "/*" ~ boneyard_txt ~ "*/" }
boneyard_txt = @{ (boneyard | !"*/" ~ ANY)* }
note_txt     = @{ (note | !"]]" ~ ANY)* }
note         =  { "[[" ~ note_txt ~ "]]" }

This almost does what I want, but as you can see in the pest editor, it splits each text character and makes the output very noisy:

- file
  - generic_line > text
    - head: "A"
    - tail: " "
    - tail: "l"
    - tail: "i"
    - tail: "n"
    - tail: "e"
    - tail: "."
  - blank_line: "\n"
  - blank_line: "\n"
  - note > note_txt: "A note."
  - blank_line: "\n"
  - blank_line: "\n"
  - note > note_txt: "This note spans\n  multiple lines."
  - blank_line: "\n"
  - generic_line > text
    - head: "T"
    - tail: "h"
    - tail: "i"
    - tail: "s"
    - tail: " "
    - tail: "i"
    - tail: "s"
    - tail: " "
    - tail: "a"
    - tail: "n"
    - tail: " "
    - bn > note > note_txt: "internal"
    - tail: " "
    - tail: "n"
    - tail: "o"
    - tail: "t"
    - tail: "e"
    - tail: "."
  - blank_line: "\n"
  - EOI: ""

If I make the text rule atomic, text = @{ bn? ~ head ~ bn? ~ (tail ~ bn?)* } then the results are cleaner and more readable, closer to how I'd like to actually use them:

- file
  - generic_line > text: "A line."
  - blank_line: "\n"
  - blank_line: "\n"
  - note > note_txt: "A note."
  - blank_line: "\n"
  - blank_line: "\n"
  - note > note_txt: "This note spans\n  multiple lines."
  - blank_line: "\n"
  - generic_line > text: "This is an [[internal]] note."
  - blank_line: "\n"
  - EOI: ""

But sadly that causes the [[ internal ]] notes and boneyards to be miscategorized as generic text lines. I also tried making text a compound-atomic ($) rule, but that didn't make any difference from a non-atomic rule in this case.

Does anyone have any suggestions here? Do I have any options besides taking the character-by-character output and concatenating them all in application code?

1

There are 1 answers

0
bjmc On

After a few attempts, I think I've got an result that's pretty close to what I wanted. This grammar...

file = { (boneyard | note | generic_line | blank_line)+ ~ EOI }

// Basics
ws            = _{ (" " | "\t") }
start         = _{ SOI | NEWLINE }
bn_open       =  { ("[[" | "/*") }
blank_line    =  { NEWLINE }
text_fragment =  { (!(NEWLINE | bn_open) ~ ANY)+ }
text          = _{ (bn | text_fragment)+ }

generic_line = _{ start ~ text ~ &NEWLINE }

// Boneyards & Notes

bn           = _{ boneyard | note }
boneyard     = _{ "/*" ~ boneyard_txt ~ "*/" }
boneyard_txt = @{ (boneyard | !"*/" ~ ANY)* }
note_txt     = @{ (note | !"]]" ~ ANY)* }
note         = _{ "[[" ~ note_txt ~ "]]" }

Produces output like...

- file
  - text_fragment: "A line."
  - blank_line: "\n"
  - note_txt: "A note."
  - blank_line: "\n"
  - note_txt: "This line spans\n  multiple lines."
  - blank_line: "\n"
  - text_fragment: "This is an "
  - note_txt: "internal"
  - text_fragment: " note."
  - blank_line: "\n"
  - EOI: ""

The trick here is making text into a container that dispatches between either text_fragment or boneyard/notescomment types. So text_fragment becomes the atomic rule, but if we make it silent by prefixing with _, then the output is reasonably clean.

Pest editor link