I am trying to build a basic Latex parser using pest library. For the moment, I only care about lines, bold format and plain text. I am struggling with the latter. To simplify the problem, I assume that it cannot contain these two chars: \
, }
.
lines = { line ~ (NEWLINE ~ line)* }
line = { token* }
token = { text_bold | text_plain }
text_bold = { "\\textbf{" ~ text_plain ~ "}" }
text_plain = ${ inner ~ ("\\" | "}" | NEWLINE) }
inner = @{ char* }
char = {
!("\\" | "}" | NEWLINE) ~ ANY
}
main = {
SOI ~
lines ~
EOI
}
Using this webapp, we can see that my grammar eats the char after the plain text.
Input:
Before \textbf{middle} after.
New line
Output:
- lines > line
- token > text_plain > inner: "Before "
- token > text_plain > inner: "textbf{middle"
- token > text_plain > inner: " after."
- token > text_plain > inner: "New line"
If I replace ${ inner ~ ("\\" | "}" | NEWLINE) }
by ${ inner }
, it fails. If add the &
in front of the suffix, it does not work either.
How can I change my grammar so that lines and bold tags are detected?
The rule
certainly matches the character following
inner
(which must be a backslash, close brace, or newline). That's not what you want: you want the following character to be part of the next token. But it's definitely seems to me reasonable to ask what happened to that character, since the token corresponding totext_plain
clearly doesn't show it.The answer, apparently, is a subtlety in how tokens are formed. According to the Pest book:
The key here, it turns out, is what is not being said.
("\\" | "}" | NEWLINE)
is not a rule, and therefore it does not trigger any token pairs. So when you iterate over the tokens insidetext_plain
, you only see the token generated byinner
.None of that is really relevant, since
text_plain
should not attempt to match the following character in any event. I suppose you realised that, because you say you tried to change the rule totext_plain = { inner }
, but that "failed". It would have been useful to know what "failure" meant here, but I suppose that it was because Pest complained about the attempt to use a repetition operator on a rule which can match the empty string.Since
inner
is a*
-repetition, it can match the empty string; definingtext_plain
as a copy ofinner
means thattext_plain
can also match the empty string; that means thattoken
({ text_bold | text_plain }
) can match the empty string, and that makestoken*
illegal because Pest doesn't allow applying repetition operators to a nullable rule. The simplest solution is to changeinner
fromchar*
tochar+
, which forces it to match at least one character.In the following, I actually got rid of
inner
altogether, since it seems redundant: