multiple error reporting with menhir: which token?

1k views Asked by At

I am writing a small parser with Menhir + Ocamllex and I have two requirements I cannot seem to meet at the same time

  • I would like to keep parsing after an error (to report more errors).
  • I would like to print the token at which the error ocurred.

I can do only 1) easily, by using the error token. I can also do only 2) easily, using the approach suggested for this question. However, I don't know of an easy way to achieve both.

The way I handle errors right now goes something like this:

pair:
| left = prodA SEPARATOR right = prodA { (* happy case *) }
| error SEPARATOR right = prodA { print_error_report $startpos;
(* would like to continue after the first error, just in case
   there is a second error, so I report both *) }

One thing that would help me is accessing the lexbuf itself, so I could get the token directly. This would mean instead of $startpos I pass something like $lexbuf But as far as I can tell, there is no official way to access the lexbuf. The solution in 1 works only at the level of the caller to the parser, where the caller is itself passing lexbuf t othe parser, but not within semantic actions.

Does anyone know if it is actually available somehow? or perhaps a workaround?

1

There are 1 answers

0
gasche On

Thanks to combined work by Frédéric Bour and François Pottier, there is a new version of Menhir available that supports incremental parsing. See the announcement email sent on December 17.

The idea of this incremental API is to reverse control: instead of the parser calling the lexer to process the input, you have a lower-level API where you manipulate the parser state which returns an updated state after each consumed token (in this is slightly more fine-grained as you can observe internal reductions that do not require new tokens). In particular, you can observe whether the resulting parser state is an error, and choose to backtrack and provide a different input (depending on your error-recovery startegy) to go farther along in your input.

The general idea is that this will allow to implement good error-recovery and error-reporting strategies on the parser-user side, and slowly deprecate the rather inflexible "error token" mechanism.

This is already usable, but work on those features is still ongoing, and you should expect a more robust support for these new features in other releases over the following months.