What other tools can help me create a small language targeting JVM, besides ANTLR?

278 views Asked by At

(I started my language adventure with ANTLR several days ago. My knowledge about language theory and compiler construction is very limited. Excuse me if this is not a valid question.)

ANTLR is a parser generator, and specifically, an ALL(*) parser. According to here, a parser is:

the part of a compiler that tries to make syntactic sense of the source code.

AFAIK a compiler should be composed of 5 stages:

  1. lexical analysis
  2. syntax analysis
  3. semantic analysis
  4. IL representation & optimization
  5. code generation

So ANTLR seems to cover only 1 and 2.

So if I want to write a compiler for an educational-purposed langauge which targets Java byte code on JVM. What other tools can I leverage for stages 3-5?

ADD 1

And why ANTLR just covers as far as 1 and 2? I guess 4 and 5 are skipped because they are too specific to the target platform. But why 3 is skipped by ANTLR?

2

There are 2 answers

5
Ira Baxter On BEST ANSWER

Regarding ADD1:

ANTLR does 1) and 2) because that was the goal defined for it. The authors thought you'd be happy to write the "rest" of any compiler from scratch.

I agree, one needs to go a lot further. There's a huge Life After Parsing.

If you want a tool that handles more than just parsing, you need one whose goal is correspondingly larger.

A more general class of tools is Program Transformation Systems (PTS). These tools allow you define a grammar, rather like ANTLR, and will generate a parser, will automatically build abstract syntax trees from source for that language, provide means (often "source-to-source" rewrite rules) for modifying those ASTs and finally prettyprinting the modified AST to produce equivalent source code output.

Many PTSs are limited to "one" language at a time; you can transform that langauge, which doesn't work for code generation. They often allow a hack where you can build a union grammar of two langauges, source and target, and then you can modify an AST in the source-language to make an AST in the target-language. This does allow code generation, but the union-language stunt creates a lot of confusion. For instance, if you have a "+" node, is it a "+" node in the source language or the target? You sure don't want to translate it twice.

Our DMS Software Reengineering Toolkit will handle many (that includes "two") languages at once. You can transform from the source, to the target langauge, and prettyprint the result. Because source "+" nodes are distinct from target "+" nodes, there is no confusion.

Generally PSTs only do AST manipulation. You can implement arbitrary semantic analysis by abusing rewrite rules to "rewrite" ASTs into boolean values that represent the result of a semantic predicate. This is pretty awkward.

DMS provides for semantic analysis through attribute grammars, which are way to define arbitrary analyses in terms of computations over the ASTs using the grammar rules as a guide. You can easily build symbol tables, control flow graphs and do typical type checking this way. DMS also provides means to do data flow analysis across control flow graphs.

Using the various semantic analysis, one can verify that the source program is a valid, run complex transformations that depend on information found "far away" in the source program, and provide optimizing transformations on the "target" language.

If you define your target language as an IL, you can then do source-to-IL transformation and optimization.

It isn't quite so easy to define an IL that is JVM code; after all, that's a binary representation of a virtual instruction set. With a PTS like DMS, you'd define a target language which was the surface-syntax for JVM instructions (e.g, what a JVM dump would produce), generate that, and then run a rather trivial post-processing step to convert that to actual JVM binary code. With DMS, you could implement that post-processing step as an attribute grammar computation over the AST for the JVM surface syntax target language.

[A side note: DMS can be obtained with a Java front end. This includes additional support machinery to parse and process JVM binary code. This could be used to implement that post-process-to-JVM binary step. Or, you can roll your own].

DMS's design goal as a tool is to cover the broad range of applications covering language translation ("compiling" is a special case) and program analysis. It is corresponding more ambitious, bigger than ANTLR, and corresponding more powerful.

2
Guy Coder On

Since you seem to be honing in on ANTLR I suggest that you purchase a copy of
"Language Implementation Patterns - Create Your Own Domain-Specific and General Programming Languages" and
"The Definitive ANTLR 4 Reference" by Terrence Parr

What other tools can I leverage for stages 3-5?

Terrence Parr the creator of ANTLR also created String Template which can be used for AST transformations, but there are other tools that fill this niche. See: List of program transformation systems. Note that Ira already noted DMS in his answer.

For stage 3, semantic analysis, you can use ANTLR. See: ANTLR: How to replace specific nodes in a subtree using a Tree grammar? as an example. Other ways to do semantic analysis are also discussed in the book.

For stages 4-5 read chapter 10, Building Bytecode Interpreters.
For a beginner it is a good place to start, but it will only get you started.

or

For stages 4-5 with a quick search I found this which after a quick read makes sense so I will mention it, but no guarantees. In short is uses Javac for phases 4-5. Since the blog is hosted on Oracle, I take it that only the Oracle Javac will work.

A quite interesting approach is to construct AST nodes representing the structure of the java code, then generate byte code from that. Actually, this is what the javac does.

Generating java byte code by building AST trees

Why ANTLR just covers as far as 1 and 2?

From Tree rewriting in ANTLR v4

To quote Terrence Parr

Because most ANTLR users don't build compilers, I decided to focus on the other applications for ANTLR v4: parsing and extracting information and then translations.

So for stages 1,2 and 3 you would use ANTLR,
for AST Transformations if necessary you use String Templates and
for stages 4 and 5 you can use Javac.

This is just a start, you have a long journey with lots of research to do. I suggest you take copious amounts of notes along the way.