What are the stages of compilation of a C++ program?

17.4k views Asked by At

Are the stages of compilation of a C++ program specified by the standard?

If so, what are they?

If not, an answer for a widely-used compiler (I'd prefer MSVS) would be great.

I'm talking about preprocessing, tokenization, parsing and such. What is the order in which they are executed and what do they do in particular?

EDIT: I know what compilation, linking and preprocessing do, I'm mostly interested in the others and the order. Explanations for these are, of course, also welcomed since I might not be the only one interested in an answer.

3

There are 3 answers

3
Keith Thompson On BEST ANSWER

Are the stages of compilation of a C++ program specified by the standard?

Yes and no.

The C++ standard defines 9 "phases of translation". Quoting from the N3242 draft (10MB PDF), dated 2011-02-28 (prior to the release of the official C++11 standard), section 2.2:

The precedence among the syntax rules of translation is specified by the following phases [see footnote].

  1. Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. [SNIP]
  2. Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. [SNIP]
  3. The source file is decomposed into preprocessing tokens (2.5) and sequences of white-space characters (including comments). [SNIP]
  4. Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed. [SNIP]
  5. Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set; [SNIP]
  6. Adjacent string literal tokens are concatenated.
  7. White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. (2.7). The resulting tokens are syntactically and semantically analyzed and translated as a translation unit. [SNIP]
  8. Translated translation units and instantiation units are combined as follows: [SNIP]
  9. All external entity references are resolved. Library components are linked to satisfy external references to entities not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

[footnote] Implementations must behave as if these separate phases occur, although in practice different phases might be folded together.

As indicated by the [SNIP] markers, I haven't quoted the entire section, just enough to get the idea across.

To emphasize, compilers are not required to follow this exact model, as long as the final result is as if they did.

Phases 1-6 correspond more or less to the preprocessor, 7 to what you might normally think of as compilation, 8 deals with templates, and 9 corresponds to linking.

(C's translation phases are similar, but #8 is omitted.)

0
Steve Jessop On

The 9 so-called "phases of translation" are listed in the standard in [lex.phases] (2.2 in C++11, 2.1 in C++03).

The detail demanded in the standard varies: preprocessing is split up into several phases, because it's important at various points in the standard exactly what has "already been done" and what is "left to do" when a particular bit of behavior is defined. So although it doesn't tell you how to write a lexer, it gives you a pretty clear roadmap.

Linking on the other hand is left mostly to the implementation to decide how it's actually achieved, because the standard doesn't care how a given name is looked up, just what it refers to.

It doesn't give any detail on parsing, either, it just says "The resulting tokens are syntactically and semantically analyzed and translated". That's because the whole of chapters 3-15 are required to fill in that detail.

It doesn't mention internal representations during parsing/translation at all, and neither does it mention optimization phases -- they're important to the design of compilers, but they're not important to the standard. Optimization can occur in different places in different compilers. For a long time, optimization was almost entirely in the compilation phase, before emitting object files, and linkers were dumb as a post. I think now serious C++ implementations can all do at least some optimization across multiple TUs. So "the others" aren't just left out of the standard, they do actually change over time.

0
Liam M On

The C++ specification is intentionally vague in many respects, mostly to remain implementation independent. A lot of the areas where the language is vague aren't a large concern anymore - for example, you can usually rely on a char being 8 bits. However, other issues such as layout of structures which use multiple inheritance is a real concern, as is the implications of virtual functions on classes. These issues impact the compatibility of code generated with different compilers. The Application Binary Interface (or ABI) of C++, isn't rigorously defined and as a result you occasionally have to dip into C where this becomes problematic. Writing a plugin interface is a good example.

Similarly, the standard doesn't give a detailed description of how a compiler should be built because there are many key decisions and features that differentiate compilers. For example, MSVC can perform partial builds (allowing edit and continue), which GCC doesn't. Generally speaking though, all compilers perform similar stages: preprocessing, syntax parsing, determining program flow, producing a symbol table, and producing a linear series of instructions which can subsequently be linked to produce an executable. Oh, and linking those object files, this is usually done by a linker.

I had a brief look, it's rather hard to find descriptions of individual compilers. I doubt there's much out there on commercial compilers like Microsoft's offering, purely for commercial reasons. GCC is your best bet, although Microsoft is happy to describe the process. This is pretty banal stuff though: compilers all work pretty much the same way. The real gold is in how they execute these stages, the algorithms and data structures they use. In that respect, I recommend this book. I bought a brand new copy for a university course a few years back, and I borrowed most of my textbooks from the library :).