Grammar - How to match optional and required whitespaces before and after words?

588 views Asked by At

I am using nearley and moo to come up with a rather complex grammar. It seems to be working fine EXCEPT for my whitespace requirements. I need to require whitespace when needed and allow it when not while keeping the grammar unambiguous.

For example:

After dinner, I went to bed.

I need to require whitespace between the words but allow it around the comma. So the following are also valid:

After dinner , I went to bed.
After dinner,I went to bed.

Below is a quick nearley grammar trying to do this. If you don't get the syntax, it's pretty easy to figure it out.

// Required whitespace
rws : [ \t]+
// Optional whitespace
ows : [ \t]*

sentence -> words %ows "," sentence
          | words

words    -> word %rws words
         -> word

word     -> [a-zA-Z]

The grammar may have issues but the idea is the same. This becomes an ambiguous grammar. How can I define an unambiguous grammar, expecting optional and required whitespaces?

2

There are 2 answers

0
MonkeyZeus On BEST ANSWER

I'm not familiar with Nearly nor Moo but the regex could be

whitespace : ([ \t]*,[ \t]*|[ \t])

and your grammar would become

word %whitespace word

Hopefully that makes sense and I didn't completely botch up the language.

0
customcommander On

I find that using makes my grammar simpler and I generally spend less time fixing ambiguous grammars as a result.

I'm not an expert in designing grammar but this is what I'd do:

lexer.js

  • word will match a sequence of characters
  • comma will match " , ", " ,", ", " and ",".
  • space will match a single space " "
  • period will match a single period "."
  • nl will match one or more newlines.
const moo = require('moo');

const lexer =
  moo.compile
    ( { word: /[a-zA-Z]+/
      , comma:/ ?, ?/
      , space: / /
      , period: /\./
      , nl: {match: /\n+/, lineBreaks: true}
      }
    );

module.exports = lexer;

grammar.ne

Here we say:

  1. A text has one or more sentences
  2. Newlines can occur before and after each sentence
  3. A sentence may start with a sequence of %word followed by either a %comma or a %space and must finish with a %word followed by a %period.

All the post-processing rules are flattening list of tokens and extract .value from tokens so that we end up with lists of words.

@{% const lexer = require("./lexer.js"); %}
@lexer lexer

text
  -> %nl sentence:+ {% ([_, sentences]) => sentences %}

sentence
  -> seq:* %word %period %nl {% ([seq, w, p, n]) => [...seq, w.value] %}

seq
  -> (%word %space) {% ([[w]]) => w.value %}
   | (%word %comma) {% ([[w]]) => w.value %}

This grammar allows to parse this text:


After breakfast, I went to work.

After lunch , I went to my desk.

After the pub,I went home.

sleep.

Example:

const nearley = require('nearley');
const grammar = require('./grammar.js');

const parser = new nearley.Parser(nearley.Grammar.fromCompiled(grammar));

parser.feed(`

After breakfast, I went to work.

After lunch , I went to my desk.

After the pub,I went home.

sleep.
`);

if (parser.results.length > 1) throw new Error('grammar is ambiguous');
JSON.stringify(parser.results[0], null, 2);

Output:

[
  [
    "After",
    "breakfast",
    "I",
    "went",
    "to",
    "work"
  ],
  [
    "After",
    "lunch",
    "I",
    "went",
    "to",
    "my",
    "desk"
  ],
  [
    "After",
    "the",
    "pub",
    "I",
    "went",
    "home"
  ],
  [
    "sleep"
  ]
]