Does the Marpa parser library support error recovery?

351 views Asked by At

I know Perl's "Marpa" Earley parser has very good error reporting.

But I can't find in its documentation or via Googling whether it has error recovery.

For instance, most C/C++ compilers have error recovery, which they use to report multiple syntax errors where often other compilers stop at the first error.

I'm actually parsing natural language and wonder if there's a way to re-sync and resume parsing after one part of the input fails.


Example, for those who can grok it:

I'm parsing syllables in the Lao language. In Lao some vowels are diacritics which are encoded as separate characters and rendered above the previous consonant. In parsing random articles from the Lao Wikipedia I ran into some text where such a vowel was doubled. This is not allowed in Lao orthography so must be a typo. But I know that within a couple of characters the text is good again.

Anyway this is the real example which piqued my general interest in error recovery or re-synchronizing with the token stream.

1

There are 1 answers

1
amon On

There are two possibilities for handling mistakes in Marpa.

“Ruby Slippers” Parsing

Marpa maintains a lot of context during scanning. We can use this context so that the parser can require some token, and we can decide whether we want to offer it to Marpa even if it isn't in the input. Consider for example a programming language that requires any statement to be terminated by a semicolon. We can then use Ruby Slippers techniques to insert semicolons at specific locations, such as at the end of a line, or before a closing brace:

use strict;
use warnings;
use Marpa::R2;
use Data::Dump 'dd';

my $grammar = Marpa::R2::Scanless::G->new({
    source => \q{
        :discard ~ ws
        Block         ::= Statement+ action => ::array
        Statement     ::= StatementBody (STATEMENT_TERMINATOR) action => ::first
        StatementBody ::= 'statement'       action => ::first
                      |   ('{') Block ('}') action => ::first
        STATEMENT_TERMINATOR ~ ';'
        event ruby_slippers = predicted STATEMENT_TERMINATOR
        ws ~ [\s]+
    },
});
my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar });

my $input = q(
    statement;
    { statement }
    statement
    statement
);

for (
  $recce->read(\$input);
  $recce->pos < length $input;
  $recce->resume
) {
    ruby_slippers($recce, \$input);
}
ruby_slippers($recce, \$input);
dd $recce->value;

sub ruby_slippers {
    my ($recce, $input) = @_;
    my %possible_tokens_by_length;

    my @expected = @{ $recce->terminals_expected };
    for my $token (@expected) {
        pos($$input) = $recce->pos;

        if ($token eq 'STATEMENT_TERMINATOR') {
            # fudge a terminator at the end of a line, or before a closing brace
            if ($$input =~ /\G \s*? (?: $ | [}] )/smxgc) {
                push @{ $possible_tokens_by_length{0} }, [STATEMENT_TERMINATOR => ';'];
            }
        }
    }

    my $max_length = 0;
    for (keys %possible_tokens_by_length) {
        $max_length = $_ if $_ > $max_length;
    }
    if (my $longest_tokens = $possible_tokens_by_length{$max_length}) {
        for my $lexeme (@$longest_tokens) {
            $recce->lexeme_alternative(@$lexeme);
        }
        $recce->lexeme_complete($recce->pos, $max_length);

        return ruby_slippers($recce, $input);
    }
}

In the ruby_slippers function, you could also count how often you needed to fudge a token. If that count exceeds some value, you could abandon the parse by throwing an error.

Skipping input

If your input may contain unparseable junk, you can try skipping that if no lexeme would otherwise be found. For this, the $recce->resume method takes an optional position argument, where the normal parsing will resume.

use strict;
use warnings;
use Marpa::R2;
use Data::Dump 'dd';
use Try::Tiny;

my $grammar = Marpa::R2::Scanless::G->new({
    source => \q{
        :discard ~ ws
        Sentence ::= WORD+ action => ::array
        WORD ~ 'foo':i | 'bar':i | 'baz':i | 'qux':i
        ws ~ [\s]+
    },
});
my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar });

my $input = '1) Foo bar: baz and qux, therefore qux (foo!) implies bar.';

try { $recce->read(\$input) };
while ($recce->pos < length $input) {
    # ruby_slippers($recce, \$input);
    try   {       $recce->resume                    }  # restart at current position
    catch { try { $recce->resume($recce->pos + 1) } }; # advance the position
    # if both fail, we go into a new iteration of the loop.
}
dd $recce->value;

While the same effect could be achieved with a :discard lexeme that matches anything, doing the skipping in our client code allows us to abort the parse if too much fudging had to be done.