Matching arbitrary delimiters

148 views Asked by At

I've had good success parsing complicated and silly old text formats with Marpa before and I'm trying to do it again.

This particular format has hundred and hundreds of different kinds of 'Begin' and 'End' blocks that look like this:

Begin BlahBlah
    asdf qwer 123
    987 xxxx
End BlahBlah

Begin FooFoo
    Begin BarBar
        some stuff (1,2,3)
    End BarBar
    whatever x
End FooFoo

How do I make a single rule that will match all of BlahBlah, BarBar, and FooFoo in the stuff above? I don't see in any examples how to dynamically capture the token and re-use it to terminate the rule, at least not with the standard scanless grammar examples. I don't want to enumerate all the different kinds of blocks because new kinds will break things, and I don't think it should be necessary.

The contents of the Begin/End blocks are immaterial to the question. In reality that stuff is a complicated mess, but nothing I don't know how to slog through. I'm hand-waving over other complicating details that make Marpa a good tool for this, such that I don't want to resort to regex.

At a bare minimum all I'm trying to achieve is a key-value map of the block type (i.e. "BlahBlah") to its contents as a string.

1

There are 1 answers

0
rjt_jr On

This doesn't exactly answer my original question because I ultimately arrived at simply ignoring the repeated string following the "End" token. I will probably follow the comment suggestion above of simply checking that the begin/end names match in a post-processing step. Operating under the assumption that the token is redundant, this seems to work OK, as a rough first cut. Critique welcome:

#!/usr/bin/perl
use warnings;
use strict;
use v5.18;
use utf8;
use feature 'unicode_strings';
use autodie;

use Marpa::R2;
use Data::Dumper;

my $g = Marpa::R2::Scanless::G->new({
        source         => \(<<'END_OF_SOURCE'),
lexeme default = latm => 1
:default ::= action => ::array
:start ::= beginend_blocks
:discard ~ <ws>

beginend_blocks ::= beginend_block+

beginend_block ::= beginend_block_header beginend_block_contents

beginend_block_header ::= ('Begin') beginend_block_name action => ::first

beginend_block_name ::= <word> 

beginend_block_contents ::= beginend_block_content_elems (beginend_block_terminator) (<word>)

beginend_block_content_elems ::= beginend_block_content_elem+
beginend_block_content_elem ::= word            action => ::first
                              | beginend_block  action => ::first

beginend_block_terminator ::= ('End')

<word> ~ <wordchar>+
<wordchar> ~ [\S]

<ws> ~ [\s]+

END_OF_SOURCE
});


my $test_str = <<THEDATA;
Begin BlahBlah
    asdf qwer 123
    987 xxxx
End BlahBlah

Begin FooFoo
    something else
    Begin BazBaz
        some stuff (1,2,3)
    End BazBaz
    whatever x
    Begin BarBar
        some stuff (1,2,3)
    End BarBar
    whatever y 
End FooFoo
THEDATA

MAIN: {
    my $re = Marpa::R2::Scanless::R->new({ grammar => $g, trace_terminals => 0 });

    for (my $pos = $re->read(\$test_str); $pos < length $test_str; $pos = $re->resume) {
        my ($pause_start, undef) = $re->pause_span;
    }

    say Dumper $re->value;
}