Transpiling/code generation - declaration of variables issue

154 views Asked by At

I have recently been working on ANTLR and Java and I built a simple grammar that parses this code and generates an AST. I also wrote a built-in interpreter to execute this code and it seems to work well:

Some notes on my toy language:

  • My language has only variable one type of "double"
  • All variables are implicitly declared on assignment.
  • All variables have global scope. I.e. I can use a variable after it is assigned even outside the block in which it is assigned.
/* A sample program */
BEGIN
    j := 1;
    WHILE j <= 5 DO
        PRINT "ITERATION NO: "; PRINTLN j;
        sumA1 := 0;
        WHILE 1 = 1 DO 
            PRINT "Enter a number, 0 to quit: ";
            i := INPUT;
            IF i = 0 THEN
                BREAK;
            ENDIF
            sumA1 := ADD sumA1, i;
        ENDWHILE
        j := ADD j, 1;
        PRINT "The sum is: "; PRINTLN sumA1;
    ENDWHILE
    j := MINUS j;
    PRINTLN j;
END

I then wrote the code generation functions into the AST to output this to C from my AST class and I get this result (beautified):

#include <stdio.h>

#include <stdlib.h>

int main(int argc, char * argv[]) {
  double j;
  j = 1.00000;
  while (j <= 5.0) {
    printf("ITERATION NO: ");
    printf("%g\n", j);
    double sumA1;
    sumA1 = 0.00000;
    while (1.0 == 1.0) {
      printf("Enter a number, 0 to quit: ");
      double i;
      scanf("%lf", & i);
      if (i == 0.0) {
        break;
      }
      sumA1 = sumA1 + i;
    }
    j = j + 1.00000;
    printf("The sum is: ");
    printf("%g\n", sumA1);
  }
  j = -j;
  printf("%g\n", j);
}

During the code generation, I am checking first if the variable name is available in the HashMap. For assignment statements/input statements, I add the variable declaration just before assignment, as you can see. For usage of variables other than assignment, I throw an Exception for non-initializing of variable before usage.

All well and good. The above code works for this example, since in my source program I am not using any variable outside the scope in which it is declared.

But there is one issue. Since I am initializing certain variables inside the blocks (like while they cannot be used outside the scope), I need a way to collect all the variables used in my source program as global in C (or at least on top of the main() function). Declaring variables just before usage in C will cause valid programs in the source language to fail to compile in C if there is usage of the variable in my program outside of the block.

I thought I can solve it by first resolving all the variables and declaring them at the start of the C program and then generating code.

But if I update the symbol table (HashMap) before generating the code, I won't have a way to know if the variable is actually assigned before usage.

What is the best way to re-design this to ensure that:

  • the code generator should check for assignment before usage. I.e. if it finds a usage before assignment, it should throw an exception/compilation error.
  • at the same time, all variables in my code should be available as global in the C generated source. So even usage of a variable outside the block is possible if it is assigned before in an inner block, since in my source language it is acceptable.

It is the first time I am attempting something like this. Please provide me pointers to any possible solution.

2

There are 2 answers

9
rici On BEST ANSWER

In the general case, detecting use before assignment is impossible. Consider the following (not very good) C code:

int sum;          /* uninitialised */
for (i = 0; i < n; ++i) {
  if (check(i)) sum = 0;
  sum += val[i];  /* Is sum initialised here? */
  process(sum);
}

If check(i) is, say, i % 10 == 0, then sum will certainly be initialised. But if it is i % 10 == 1, then sum is used uninitialised in the first iteration. In general, whether sum is used uninitialised depends on the value of check(0). But there may be no way to know what that is. check() might be an external function. Or its return value might be dependent on input. Or it might be based on a difficult computation.

That doesn't mean you shouldn't try to detect the problem. You could use symbolic execution, for example, to try to compute a conservative estimate of undefined use. You could throw an exception if you could prove undefined use, and issue a warning if you can't prove that all uses are defined. (Many compilers use a variant of this technique.) That might be an interesting exercise in control-flow analysis.

But for a real-world solution, given that all variables are numeric, I'd suggest just automatically initialising all variables to 0, as part of the language semantics.

0
hari On

I have kind of solved it by removing variable declarations from the individual assignment nodes and just adding the variable used in the assignment node into the global hashmap, and then generated the declarations after running through the tree.

It kind of works like this:

  • Walk the AST. If I come across an usage of variable (other than in an assignment/input statement, generate an exception/error.
  • If I come across an assignment/input statement, add the variable to the global hashmap. But don't declare it in the code generation for the particular node.
  • After generating the entire code - run through the global hashmap and generate the declarations
  • Put together the main program by concatenating the declaration statements and the generated code.

But I realized that, this can lead to potential problems where a variable might be initialized inside an IF block and used outside. If the program has executed the IF block, then no problem, but if the IF block is skipped then I get an exception in my interpreter, but the code generation in C still works properly. However, the output in the C program is an uninitialized variable if the IF block is not executed.

Take for example (in my code)

BEGIN
   i := INPUT;
   IF i < 10 THEN
      j := MUL i, 10;
   ENDIF
   PRINT j;
END

which spits out this C code (prettified)

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
   double i;
   double j;
   scanf("%lf", &i);
   if (i < 10.0)
   {
      j = i *10.0000;
   }
   printf("%g\n", j);
}

In such a scenario, my built-in interpreter will throw an exception if the IF block is not reached and compiled i.e. when i >= 10 (since j will be left uninitialized). However, the equivalent C code is generated and compiled properly, but the j will be an uninitialized variable leading to a runtime behaviour.

But for now, I think I can accept this behaviour, since in any case, use of potentially uninitialized variable is a problem with the design of the program itself.

I suppose the other alternative is to initialize the variable with implicit NULL (or NaN) value for uninitialized and check for that instead.