I know Perl's "Marpa" Earley parser has very good error reporting.
But I can't find in its documentation or via Googling whether it has error recovery.
For instance, most C/C++ compilers have error recovery, which they use to report multiple syntax errors where often other compilers stop at the first error.
I'm actually parsing natural language and wonder if there's a way to re-sync and resume parsing after one part of the input fails.
Example, for those who can grok it:
I'm parsing syllables in the Lao language. In Lao some vowels are diacritics which are encoded as separate characters and rendered above the previous consonant. In parsing random articles from the Lao Wikipedia I ran into some text where such a vowel was doubled. This is not allowed in Lao orthography so must be a typo. But I know that within a couple of characters the text is good again.
Anyway this is the real example which piqued my general interest in error recovery or re-synchronizing with the token stream.
There are two possibilities for handling mistakes in Marpa.
“Ruby Slippers” Parsing
Marpa maintains a lot of context during scanning. We can use this context so that the parser can require some token, and we can decide whether we want to offer it to Marpa even if it isn't in the input. Consider for example a programming language that requires any statement to be terminated by a semicolon. We can then use Ruby Slippers techniques to insert semicolons at specific locations, such as at the end of a line, or before a closing brace:
In the
ruby_slippers
function, you could also count how often you needed to fudge a token. If that count exceeds some value, you could abandon the parse by throwing an error.Skipping input
If your input may contain unparseable junk, you can try skipping that if no lexeme would otherwise be found. For this, the
$recce->resume
method takes an optional position argument, where the normal parsing will resume.While the same effect could be achieved with a
:discard
lexeme that matches anything, doing the skipping in our client code allows us to abort the parse if too much fudging had to be done.