Best parser generator for parsing many small texts in C++?

4k views Asked by At

I am, for performance reason, porting a C# library to C++. During normal operation, this library needs, amongst other things, to parse about 150'000 math expressions (think excel formulas) with an average length of less than 150 characters.

In the C# version, I used GOLD parser to generate parsing code. It can parse all 150'000 expressions in under one second.

Because we were thinking about extending our language, I figured the move to C++ might be a good chance to change to ANTLR. I have ported the (simple) grammar to ANTLR and generated C code out of it. Parsing the 150'000 expressions takes over 12 seconds, because for each expression, I need to create a new ANTL3_INPUT_STREAM, token stream, lexer and parser - there is, at least in version 3.4, no way to reuse them.

I'd be grateful is someone could give me a recommendation what to use instead - GOLD is of course an option though generating C++ or C code seems a lot more complicated than the C# variety. My grammar is LALR and LL(1) compatible. Paramount concern is parsing performance on small inputs.

6

There are 6 answers

1
Tristram Gräbener On BEST ANSWER

I would try boost::spirit. It is often extreamly fast (even for parsing simple things like an integer it can be faster than the C function atoi http://alexott.blogspot.com/2010/01/boostspirit2-vs-atoi.html)

http://boost-spirit.com/home/

It has nice things : header only, so dependency hell, liberal licence.

However be warned that the learning curve is difficult. It's modern C++ (no pointer, but a lot of template and very frustrating compiling errors), so coming from C or C#, you might not be very comfortable.

0
Mike Dunlavey On

I've written many parsers, and hand-coded recursive-descent is the way I do it. They are easy to write and pretty much optimal.

That said, if speed is what you're after, no matter what you write there will be plenty of room to speed it up. These will be in ways that could surprise you, because anything you could think of, you would have already done.

Here's a slide set showing how to do it.

5
chill On

Instead of expr make you grammar recognize sequence-of-expr.

EDIT:

Instead of having (bison syntax):

start: expr { process_expr ($1); }
     ;

have:

start: expr_seq ;

expr_seq:   expr          { process_expr ($1); }
          | expr_seq expr { process_expr ($2); }
          ;
0
Matthieu M. On

The best performance I have seen in parsing came from Boost.Spirit.Qi which expresses the grammar in C++ using meta-template programming. It is not for the faint of heart though.

This will need be well insulated, and the compilation time of the file containing the parser will increase to several seconds (so better make sure there is as few as possible there).

0
jalf On

If the grammar to be parsed is simple, you might just write the parser by hand.

Most parser generators are designed to make it easy to whip up a working parser, and execution time often suffers as a result.

0
Basile Starynkevitch On

If the syntax of your expression is simple enough, consider making a hand-written recursive descent parser. It can run really fast, and give you the ability (with enough care) to report nicely and specifically syntax errors.

You could also use bison, but I believe a hand-written recursive parser would probably go faster.

And you can do lexing with a flex generated lexer, and do the parsing manually, in a recursive descent way.

For your information the GCC compiler has its own recursive-descent parsers for C++ & C at least. It does not use parser generators (like bison or ANTLR) anymore.