Forcing gaps between words in a Marpa grammar

229 views Asked by At

I'm trying to set up a grammar that requires that [\w] characters cannot appear directly adjacent to each other if they are not in the same lexeme. That is, words must be separated from each other by a space or punctuation.

Consider the following grammar:

use Marpa::R2; use Data::Dump;

my $grammar = Marpa::R2::Scanless::G->new({source  => \<<'END_OF_GRAMMAR'});

:start ::= Rule
Rule ::= '9' 'september'

:discard ~ whitespace
whitespace ~ [\s]+

END_OF_GRAMMAR

my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
dd $recce->read(\'9september');

This parses successfully. Now I want to change the grammar to force a separation between 9 and september. I thought of doing this by introducing an unused lexeme that matches [\w]+:

use Marpa::R2; use Data::Dump;

my $grammar = Marpa::R2::Scanless::G->new({source  => \<<'END_OF_GRAMMAR'});

:start ::= Rule
Rule ::= '9' 'september'

:discard ~ whitespace
whitespace ~ [\s]+

word ~ [\w]+      ### <== Add unused lexeme to match joined keywords
END_OF_GRAMMAR

my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
dd $recce->read(\'9september');

Unfortunately, this grammar fails with:

A lexeme is not accessible from the start symbol: word
Marpa::R2 exception at marpa.pl line 3.

Although this can be resolved by using a lexeme default statement:

use Marpa::R2; use Data::Dump;

my $grammar = Marpa::R2::Scanless::G->new({source  => \<<'END_OF_GRAMMAR'});
lexeme default = action => [value]  ### <== Fix exception by adding lexeme default statement

:start ::= Rule
Rule ::= '9' 'september'

:discard ~ whitespace
whitespace ~ [\s]+

word ~ [\w]+
END_OF_GRAMMAR

my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
dd $recce->read(\'9september');

This results in the following output:

Inaccessible symbol: word
Error in SLIF parse: No lexemes accepted at line 1, column 1
* String before error: 
* The error was at line 1, column 1, and at character 0x0039 '9', ...
* here: 9september
Marpa::R2 exception at marpa.pl line 16.

That is, the parse has failed due to the fact that there is no gap between 9 and september which is exactly what I want to happen. The only fly in the ointment is that there is an annoying Inaccessible symbol: word message on STDERR because the word lexeme is not used in the actual grammar.

I see that in Marpa::R2::Grammar I could have declared word as inaccessible_ok in the constructor options but I can't do that in Marpa::R2::Scanless.

I also could have done something like the following:

Rule ::= nine september
nine ~ word
september ~ word

then used a pause to use custom code to examine the actual lexeme value and return the appropriate lexeme depending on the value.

What is the best way to construct a grammar that uses keywords or numbers and words but will disallow adjacent lexemes to be run together without white space or punctuation separating them?

1

There are 1 answers

1
amon On

Well, the obvious solution is to require some whitespace in between (on the G1 level). When we use the following grammar

:default ::= action => ::array

:start ::= Rule
Rule ::= '9' (Ws) 'september'

Ws ::= [\s]+

:discard ~ whitespace
whitespace ~ [\s]+

then 9september fails, but 9 september is parsed. Important points to note:

  • Lexemes can be both discarded and required, when they are both a longest token. This is why the :discard and Ws rule don't interfere with each other. Marpa doesn't mind this kind of “ambiguity”.
  • The Ws rule is enclosed in parens, which discards the value – to keep the resulting parse tree clean.
  • You do not usually want to use tricks like phantom lexemes to misguide the parser. That way lies breakage.
  • When every bit of whitespace is important, you might want to get rid of :discard ~ whitespace. This is meant to be used e.g. for C-like languages where whitespace traditionally does not matter.