Internal Compiler Data Structures
Parse Tree: The output of the syntax analysis. It represents the syntax structure of the program.
Abstract Syntax Tree: A modified parse tree. In most compilers, it shares the same structure as the original parse tree but describes essential program structure.
Symbol Tables: Data structures to maintain semantic and code generation information for variable/function/type identifiers.
Intermediate Representation(IR): Architecture independent code that is close to assemblies.
Parse Tree | AST |
Abstract Syntax Tree
Abstract Syntax Tress(ASTs) are designed to capture all the essential structural information about a program being compiled
The basic process for building an AST is very simple:
-
The leaf nodes are typically constants and identifiers. Build a node to represent each of these entities.
-
Interior nodes are build as needed to represent language constructs. Typically, interior nodes contain links to one or more child nodes.
The processing of AST nodes for semantic analysis is also simple:
-
Process the AST depth first. This guarantees that child nodes are processed before the parent nodes.
-
At each parent node do any processing required using information from already processed child nodes.
Generic ASTNode Class in MiniC Compiler
1
2
3
4
5
6
7
8
9
10
struct SourceLocation{
int Line, Row;
}
class ASTNode {
std::vector<ASTNode*> Children;
ASTNode* Parent;
SourceLocation SrcLoc;
Program* Root;
}
All other AST classes in MiniC Compiler inherit this base class.
Children nodes will have different semantic meaning based on the specific AST class type.
IfStatement Class in MiniC Compiler
1
2
3
4
5
6
7
8
9
10
11
12
13
class Statement : public ASTNode{};
class IfStatement : public Statement {
public:
Expr* condExpr() {
return (Expr*) getChild(0);}
Statement* thenStmt() {
return (Statement*) getChild(1);}
bool hasElse(){
return numChildren() > 2;}
Statement* elseStmt() {
assert (numChildren() > 2);
return (Statement*) getChild(2);}
};
The children are the conditional expression, the then body statement, and the else body statement, respectively.
Expr, Statement, and IfStatement are all sub-class of ASTNode.
CallExpr Class in MiniC Compiler
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class Expr : public ASTNode {
Type ExprType;}
class CallExpr : public Expr {
public:
Identifier* callee(){
return (Identifier*) getChild(0);
}
size_t numArgs(){
assert(numChildren() > 0);
return numChildren() - 1;
}
Expr *arg(size_t i){
assert (i + 1 < numChildren());
return (Expr*)getChild(i + 1);
}
};
The first child is the callee(The calling function).
The follow-up children are expressions that correspond to the passed arguments in order.
Compiler Tables
The Symbol Table contains information about declared things(usually identifiers).
A Type Table is used to record information about builtin and user declared types.
Efficient table management is often a major performance issue in the design of a compiler.
Table lookup must follow the languages scope rules. Table usage:
-
Declaration processing - add entries for identifiers and types being declared
-
Semantic analysis - look up identifiers to determine their attributes. Look up types to validate program usage.
-
Code generation - look up identifiers to determine their attributes.
Table Organization & Table Storage
Possible table organizations
-
One global table for the entire program.
-
Separate linked tables for each major scope.
Table storage
-
Fixed size memory resident, limits program size
-
Statically or dynamically allocated in memory
0 Partially memory resident, remained on disk. LRU caching works well.
Symbol Table
Typical Symbol Table Entries
A symbol table entry contains the attributes of symbols that might be different for each symbol
A typical symbol table entry might contain:
-
Name of the item
-
Kind of item, e.g. constant, variable, type, procedure, function, etc.
-
Type of the item (index into type table)
-
Attributes or properties associated with the item
-
Size of the item
-
Run time address for the item
-
Value for the item
-
Links to related symbols, e.g. parameter lists, enumerated constants, fields in a struct or union
Typical Symbol Table Operations
The following operations are typical of a symbol table class/module
-
Create Symbol Table
-
enter new scope
-
Exit scope
-
Lookup symbol in current scope
-
Lookup symbol using scope rule
-
Enter the new symbol in current scope
-
Enter the new symbol in designated scope
-
Delete the symbol from table
-
Retain symbol in symbol table
-
Get or set fields in the symbol table
Type Table
Generic Type Table Entry
A type table entry contains the information that might be different for each distinct type.
A typical type table entry might contain:
-
Name of the item (often omitted)
-
Kind of type, e.g., typedef, enum, scalar, array, struct, union, etc.
-
Attributes or properties associated with the type, e.g, const, static, etc.
-
Size and memory alignment for objects of this type
-
Actual definition of the type, usually involves links to embedded components.
-
Links to related definitions.
-
For many modern languages, the type table is a directed acyclic graph.
- Example: array type
- Number of dimensions, size/alignment, bound of each dimension
array dim=3 size=16000 align=4 bounds element - Example: struct
- Number of fields, size/alignment, link to entries for fields
record nFields=23 size=1440 align=8 fields - Example: function
- Number of parameters, entry point, links to return type and parameters
funcproc nParam=3 entryAt 0X00640 return type parameters
Handling Values
The compiler needs to be able to represent in its tables any kind of value, i.e. constant, that could occur in a program.
To conserve table space, some form of union data structure is often used:
1
2
3
4
5
6
7
8
9
10
struct constantDesc {
shoft constantKind ;
union {
int intValue;
float floatValue;
char charValue;
char *stringValue;
void *bigValue;
} constantValue;
}
Large constant values, e.g. initialization for arrays or structs often require special storage.
Scopes and Declarations
Assume that a program consists of some number of nested scopes of declarations(e.g., main program, begin - end blocks, functions and procedures).
Identifiers are declared in one or more scopes.
The scope rule for the programming language determines the visibility of symbols declared in one scope in other scopes.
The most common scope rule is to allow identifiers declared in a scope to be automatically visible in all contained (properly nested) scopes.
Major and Minor Scopes
Major Scope: A major scope corresponds to a construct with special significance in the programming language.
-
Examples: the main program, body of a class, body of a routine
-
A major scope has significance beyond symbol table lookup, it is also usually a unit for storage allocation.
Minor Scope: A minor scope is a small scope of less significance that occurs within a major scope.
- Examples: scopes created within statements using {}.
Recommended Strategy:
-
Merge the symbol table entries for minor scopes into the symbol table for the nearest enclosing major scope.
-
Visibility of identifiers declared in minor scopes is controlled by the lookup algorithm
Scope Rules and Tables
In most programming languages a program is partitioned into some (possibly overlapping) scopes of declaration.
The scope rule for a language specifies the algorithm that must be used to search compiler tables.
To implement most scope rules, each scope of declaration is logically a separate set of symbol table entries.
For the most common scope rule, scopes are properly nested, and the symbol table behaves in a strictly stack-like fashion.
Most compilers distinguish symbol visibility from symbol storage. The scope rule determines which symbols are visible at any point in the program.
Retained Symbols
The formal parameters of functions and procedures need special symbol table handling.
The symbol table entries are created when the prototype or actual definition of the routine is processed.
The entries are used during processing the body of the routine.
After the routine has been processed, the symbol table entries for the formal parameters of the routine need to be retained in the symbol table, so they can be used to process calls of the routine.
The usual technique is to leave the formal parameter entries in the symbol table linked to the routine entry.
The parameters won’t be visible to ordinary symbol table lookup.
Symbol Table Performance Considerations
Hash table is the most common choice for implementing the symbol table. O(1) insertion and look ups.
For complicated compilers, map each identifier string into an index integer. Then use index integers as symbol table keys.
Look Up in Scopes: Iteratively fetch the parent scope until if finds the target symbol. Or it returns null when reaching and not finding in the root program scope.
Cache Look Up: It might be a good idea to cache the look up results locally so that it can return the results faster when looking for the same variable again.
MiniC Symbol Table Example
1
2
3
4
5
6
7
struct VarSymbolEntry {
Type VarType;
llvm:: Value *LLVMValue;
};
class VarSymbolTable {
std::map<std;:string, VarSymbolEntry> Table;
}
Variable symbol table maps each variable to its type, and the corresponding Llvm IR object (used during code generation).
It contains methods to get/set symbol table entries for new variables.
1
2
3
4
5
6
7
8
struct FuncSymbolEntry {
Type RetyrnType;
std::vector<Type> ParameterTypes;
bool HasBody;
};
class FuncSymbolTable {
std::map<std::string, FuncSymbolEntry> Table;
};
Function symbol table maps each function name to its return type, and the list of parameter types. It also contains a flag to indicate whether we have seen the definition (function body) yet.
It contains methods to get/set symbol table entries for new functions.
How To Handle Symbol Table For MiniC
MiniC does not allow forward reference. This means that we can construct the symbol table when we visit the AST tree to perform the semantic analysis.
When encounter a valid variable declaration, add the variable with its type information into the variable symbol table.
When encounter a valid function declaration, add the function with its signature type information into the function symbol table.
When checking the semantic correctness of the program, i.e., whether the type of two expressions match, you will query the symbol table accordingly.