Can I collect attributes from my skipper parser?

148 views Asked by At

I have a data file format which includes

  • /* comments */
  • /* nested /* comments */ too */ and
  • // c++ style single-line comments..

As usual, these comments can occur everywhere in the input file where normal white space is allowed.

Hence, rather than pollute the grammar proper with pervasive comment-handling, I have made a skipper parser which will handle white space and the various comments.

So far so good, and i am able to parse all my test cases.

In my use case, however, any of the parsed values (double, string, variable, list, ...) must carry the comments preceding it as an attribute, if one or more comments are present. That is, my AST node for double should be

struct Double {
   double value;
   std::string comment;
};

and so forth for all the values I have in the grammar.

Hence I wonder if it is possible somehow to "store" the collected comments in the skipper parser, and then have them available for building the AST nodes in the normal grammar?

The skipper which processes comments:

template<typename Iterator>
struct SkipperRules : qi::grammar<Iterator> {
    SkipperRules() : SkipperRules::base_type(skipper) {
        single_line_comment = lit("//") >> *(char_ - eol) >> (eol | eoi);
        block_comment = ((string("/*") >> *(block_comment | char_ - "*/")) >> string("*/"));
        skipper = space | single_line_comment | block_comment;
    }
    qi::rule<Iterator> skipper;
    qi::rule<Iterator, std::string()> block_comment;
    qi::rule<Iterator, std::string()> single_line_comment;
};

I can store the commments using a global variable and semantic actions in the skipper rule, but that seems wrong and probably won't play well in general with parser backtracking. What's a good way to store the comments so they are later retrievable in the main grammar?

1

There are 1 answers

14
sehe On BEST ANSWER

I can store the commments using a global variable and semantic actions in the skipper rule, but that seems wrong and probably won't play well in general with parser backtracking.

Good thinking. See Boost Spirit: "Semantic actions are evil"?. Also, in your case it would unnecessarily complicate the correlation of source location with the comment.

can I collect attributes from my skipper parser?

You cannot. Skippers are implicitly qi::omit[] (like the separator in the Kleene-% list, by the way).

In my use case, however, any of the parsed values (double, string, variable, list, ...) must carry the comments preceding it as an attribute, if one or more comments are present. That is, my AST node for double should be

struct Double {
   double value;
   std::string comment;
};

There you have it: your comments are not comments. You need them in your AST, so you need them in the grammar.

Ideas

I have several ideas here.

  1. You could simply not use the skipper to soup up the comments, which, like you mention, is going to be cumbersome/noisy in the grammar.

  2. You could temporarily override the skipper to just be qi::space at the point where the comments are required. Something like

    value_ = qi::skip(qi::space) [ comment_ >> (string_|qi::double_|qi::int_)  ];
    

    Or given your AST, maybe a bit more verbose

    value_ = qi::skip(qi::space) [ comment_ >> (string_|double_|int_) ];
    string_ = comment_ >> lexeme['"' >> *('\\' >> qi::char_ | ~qi::char_('"')) >> '"'];
    double_ = comment_ >> qi::real_parser<double, qi::strict_real_policies<double> >{};
    int_    = comment_ >> qi::int_;
    

    Notes:

    • in this case make sure the double_, string_ and int_ are declared with qi::space_type as the skipper (see Boost spirit skipper issues)
    • the comment_ rule is assumed to expose a std::string() attribute. This is fine if used in the skipper context as well, because the actual attribute will be bound to qi::unused_type which compiles down to no-ops for attribute propagation.
    • As a subtler side note I made sure to use strict real policies in the second snippet so that the double-branch won't eat integers as well.
  3. A fancy solution might be to store the souped up comment(s) into a "parser state" (e.g. member variable) and then using on_success handlers to transfer that value into the rule attribute on demand (and optionally flush comments on certain rule completions).

    I have some examples of what can be achieved using on_success for inspiration: https://stackoverflow.com/search?q=user%3A85371+on_success+qi. (Specifically look at the way position information is being added to AST nodes. There's a subtle play with fusion-adapted struct vs. members that are being set outside the control of autmatic attribute propagation. A particularly nice method is to use a base-class that can be generically "detected" so AST nodes deriving from that base magically get the contextual comments added without code duplication)

    Effectively this is a hybrid: yes you use semantic actions to "side-channel" the comment values. However, it's less unwieldy because now you can deterministically "harvest" those values in the on-success handler. If you don't prematurely reset the comments, it should even generically work well under backtracking.

    A gripe with this is that it will be slightly less transparent to reason about the mechanics of "magic comments". However, it does sit well for two reasons:

    - "magic comments" are a semantic hack whichever way you look at it, so it matches the grammar semantics in the code
    - it does succeed at removing comment noise from productions, which is effectively what the comments were from in the first place: they were embellishing the semantics without complicating the language grammar.
    

I think option 2. is the "straight-forward" approach that you might not have realized. Option 3. is the fancy approach, in case you want to enjoy the greater genericity/flexibility. E.g. what will you do with

  /*obsolete*/ /*deprecated*/ 5.12e7

Or, what about

  bla = /*this is*/ 42 /*also relevant*/;

These would be easier to deal with correctly in the 'fancy' case.

So, if you want to avoid complexity, I suggest option 2. If you need the flexibility, I suggest option 3.