Compiler Optimization - Introduction

Purpose of compilers

Translate one language into another
- e.g., convert C++ into x86 object code
Improve (i.e. “optimize”) the code
- e.g., make the code run 3 times faster
  - or more energy efficient, more robust, etc.

How can the compiler improve performance

Execution time = Operation count * Machine cycles per operation

Minimize the number of operations
- Arithmetic operations, memory accesses
Replace expensive operations with simpler ones
- e.g., replace 4-cycle multiplication with 1-cycle shift
Minimize cache misses
- both data and instruction accesses
Perform work in parallel
- Instruction scheduling within a thread
- Parallel execution across multiple threads

Optimization Formalization

Formulate optimization problem
- Identify opportunities of optimization (problem domain of the optimization)
  - Applicable across many programs
  - Affect key parts of the program (loops/recursions)
  - Amenable to “efficient enough” algorithm
Representation
- Must abstract essential details relevant to optimization
Analysis
- Detect when it is desirable and safe to apply transformation
Code Transformation

Representation: Instructions

Three-address code

A := B op C
LHS: name of variable e.g., x, A[t] (address of A + contents of t)
RHS: value
There is at most one operator on the right side of an instruction
- t2 = a + b + 3 will be replaced to
  - t1 = a + b
  - t2 = t1 + 3
Each 3AC contains at most 3 addresses
- Address can be one of the following
  - Names: a, b
  - Constant: 3
  - Compiler-generated temporary: t1, t2

Examples

code	operator
x = y bop z	bop: binary arithmetic or logical operation
x = uop z	uop: unary operation (minus, negation, casting)
x = y
goto L	goto L: unconditional jump to a program location L
if x goto L:	if … goto L: conditional jump
if x ropy goto L:	rop: relational operator (>, <, ==, >=, <=, etc.)

Optimization Example

Bubblesort program that sorts an array A that is allocated in static storage:

an element of A uses four bytes of a byte-addressed machine
elements of A are number 1 through n (n is a variable)
A[j] is in location &A + 4 * (j - 1)

FOR i := n-1 DOWNTO 1 DO
    FOR j := 1 TO i DO
        IF A[j] > A[j + 1] THEN BEGIN
            temp := A[j];
            A[j] := A[j+1];
            A[j+1] := temp;
        END

Translated code

    i := n - 1
S5: if i < 1 goto s1 ; boundary condition for outer loop
    j := 1
S4: if j > i goto s2 ; boundary condition for inner loop
    t1 := j - 1
    t2 := 4 * t1
    t3 := A[t2] ; Address for A[j]
    t4 := j + 1
    t5 := t4 - 1
    t6 := 4 * t5
    t7 := A[t6] ; Address for A[j + 1]
    if t3 <= t7 goto s3 ; Exit condition for comparision
    t8 := j - 1
    t9 := 4 * t8
    temp := A[t9] ; Address temp with A[j]
    t10 := j + 1
    t11 := t10 - 1
    t12 := 4 * t11
    t13 := A[t12] ; Address for A[j + 1]
    t14 := j - 1
    t15 := 4 * t14
    A[t15] := t13 ; Assign A[j] with A[j + 1]
    t16 := j + 1
    t17 := t16 - 1
    t18 := 4 * t17
    A[t18] := temp ; Assign A[j + 1] with temp / old value of A[j]
S3: j := j + 1
    goto S4 ; Start new iteration of the inner loop
S2: i := i - 1
    goto S5 ; Start new iteration of the outer loop
S1:

Representation: Basic Block

Basic Block: Maximal sequences of consecutive three-address instructions with the properties
- Only the first statement can be reached from outside the block (no branches into middle of block)
- It can be exited only at the end

Algorithm for building Basic Block

  INPUT: A sequence of three-address instructions of P
  OUTPUT: A list of basic blocks of P
  METHOD: (1) Determine the leads in  P
          - The first instruction in P is a leader
          - Any target instruction of a conditional / unconditional jump is a leader
          - Any instruction that immediatedly follows a jump is a leader
            
          (2) Build BBs for P
          - A BB consists of a leader and all its subsequent instructions 
          - until the next leader

Example

  x = input
  y = x - 1
  z = x * y
  if z < x goto (7)
  p = x / y
  q = p + y
  a = q
  b = x + a
  c = 2a - b
  if p == q goto (12)
  goto (3)
  return

Determine leaders (line number):
- The first line: 1
- The target of jump: 3, 7, 12
- Line follow after the jump: 5, 11, 12
- 1, 3, 5, 7, 11, 12
Build BBs for P:
- B1: (1, 2)
- B2: (3, 4)
- B3: (5, 6)
- B4: (7, 8, 9, 10)
- B5: (11)
- B6: (12)

Flow Graphs

Nodes: Basic Blocks
Edges: Block A to B if and only if
- There is a conditional or unconditional jump from the end of A to the beginning of B
- B immediately follows A in the original order of instuctions and A does not end in anunconditional jump

Sources of Optimizations

Algorithm optimization
algebraic optimization
- A := B + 0 => A := B
Local optimizations
- within a basic block – across instructions
Global optimizations
- within a flow graph – across basic blocks
Interprocedural analysis
- within a program – across procedures (flow graphs)

Local Optimizations

Analysis & transformation performed within a basic block
No control flow information is considered
Examples of local optimizations:
- local common subexpression elimination
  - analysis: same expression evaluated more than once in b
  - transform: replace with single calculation
- local constant folding or elimination
  - analysis: expression can be evaluated at compile time
  - transform: replace by constant, compile-time value
- dead code elimination

Example

  B1: i := n - 1
  B2: if i < 1 goto out ; boundary condition for outer loop
  B3: j := 1
  B4: if j > i goto B5 ; boundary condition for inner loop
  B6: t1 := j - 1
      t2 := 4 * t1
      t3 := A[t2] ; Address for A[j]
      -------------
      |t4 := j + 1 |
      |t5 := t4 - 1| ; t5 can be replaced as j
      |t6 := 4 * t5| ; t6 can be replaced as 4 * j
      -------------
      t7 := A[t6] ; Address for A[j + 1]
      if t3 <= t7 goto B8 ; Exit condition for comparision
  B7: t8 := j - 1
      t9 := 4 * t8
      temp := A[t9] ; Address temp with A[j]
      ---------------
      |t10 := j + 1  |
      |t11 := t10 - 1| ; t11 can be replaced as j
      |t12 := 4 * t11| ; t12 can be replace as 4 * j
      ---------------
      t13 := A[t12] ; Address for A[j + 1]
      --------------
      |t14 := j - 1  | ; t14 is literally same as t8
      |t15 := 4 * t14| ; t15 is literally same as t9
      |A[t15] := t13 |; Assign A[j] with A[j + 1]
      ---------------
      |t16 := j + 1  | ; t16 is literally same as t10
      |t17 := t16 - 1| ; t17 is literally same as t11
      |t18 := 4 * t17| ; t18 is literally same as t12
      |A[t18] := temp| ; Assign A[j + 1] with temp / old value of A[j]
      ---------------
  B8: j := j + 1
      goto B4 ; Start new iteration of the inner loop
  B5: i := i - 1
      goto B2 ; Start new iteration of the outer loop
  out:

Optimized Code

  B1: i := n - 1
  B2: if i < 1 goto out ; boundary condition for outer loop
  B3:    j := 1
  B4: if j > i goto B5 ; boundary condition for inner loop
  B6: t1 := j - 1
      t2 := 4 * t1
      t3 := A[t2] ; Address for A[j]
      -------------
      |t6 := 4 * j|
      -------------
      t7 := A[t6] ; Address for A[j + 1]
      if t3 <= t7 goto B8 ; Exit condition for comparision
  B7: t8 := j - 1
      t9 := 4 * t8
      temp := A[t9] ; Address temp with A[j]
      ---------------
      |t12 := 4 * j |
      ---------------
      t13 := A[t12] ; Address for A[j + 1]
      --------------
      |A[t9] := t13 |; Assign A[j] with A[j + 1]
      ---------------
      |A[t12] := temp| ; Assign A[j + 1] with temp / old value of A[j]
      ---------------
  B8: j := j + 1
      goto B4 ; Start new iteration of the inner loop
  B5: i := i - 1
      goto B2 ; Start new iteration of the outer loop
  out:

(Intraprocedural) Global Optimizations

Global versions of local optimizations
- global common subexpression elimination
- global constant propagation
- dead code elimination
Loop optimizations
- reduce code to be executed in each iteration
- code motion
- induction variable elimination
Other control structures
- Code hoisting: eliminates copies of identical code on parallel paths in a flow graph to reduce code size

Example

  B1: i := n - 1
  B2: if i < 1 goto out ; boundary condition for outer loop
  B3:    j := 1
  B4: if j > i goto B5 ; boundary condition for inner loop
  B6: t1 := j - 1
      t2 := 4 * t1
      t3 := A[t2] ; Address for A[j]
      t6 := 4 * j
      t7 := A[t6] ; Address for A[j + 1]
      if t3 <= t7 goto B8 ; Exit condition for comparision
      ----------------
  B7: |t8 := j - 1   |  ; t8 is same as t1 in B2
      |t9 := 4 * t8  |  ; t9 is same as t2 in B2
      |temp := A[t9] |  ; temp is same as t2 in B2, Address temp with A[j]
      |t12 := 4 * j  |  ; t12 is same as t6 in B2
      |t13 := A[t12] |  ; t13 is same as t7 in B2, Address for A[j + 1]
      |A[t9] := t13  |  ; Assign A[j] with A[j + 1]
      |A[t12] := temp|  ; Assign A[j + 1] with temp / old value of A[j]
      ---------------
  B8: j := j + 1
      goto B4 ; Start new iteration of the inner loop
  B5: i := i - 1
      goto B2 ; Start new iteration of the outer loop
  out:

Optimized Code (After Global Common Subexpression Elimination)

  B1: i := n - 1
  B2: if i < 1 goto out ; boundary condition for outer loop
  B3: j := 1
  B4: if j > i goto B5 ; boundary condition for inner loop
  B6: t1 := j - 1
      t2 := 4 * t1
      t3 := A[t2] ; Address for A[j]
      t6 := 4 * j
      t7 := A[t6] ; Address for A[j + 1]
      if t3 <= t7 goto B8 ; Exit condition for comparision
      --------------
  B7: |A[t2] := t7  | ; Assign A[j] with A[j + 1]
      |A[t6] := t2  | ; Assign A[j + 1] with temp / old value of A[j]
      ---------------
  B8: j := j + 1
      goto B4 ; Start new iteration of the inner loop
  B5: i := i - 1
      goto B2 ; Start new iteration of the outer loop
  out:

Induction Variable Elimination

Intuitively
- Loop indices are induction variables
  - (Counting iterations)
- Linear functions of the loop indices are also induction variables
  - (For accessing arrays)
Analysis: Detection of induction variable
Optimizations:
- Strength reduction:
  - Replace multiplication by additions
- Elimination of loop index:
  - Replace termination by test on other induction variables

Example:

  B1: i := n - 1
  B2: if i < 1 goto out ; boundary condition for outer loop
      -----------------   
  B3: |j := 1          | ; Loop variable for accessing array
  B4: |if j > i goto B5| ; boundary condition for inner loop
  B6: |t1 := j - 1     |
      |t2 := 4 * t1    |
      -----------------
      t3 := A[t2] ; Address for A[j]
      ------------
      |t6 := 4 * j| ; Strength reduction
      ------------
      t7 := A[t6] ; Address for A[j + 1]
      if t3 <= t7 goto B8 ; Exit condition for comparision
  B7: A[t2] := t7   ; Assign A[j] with A[j + 1]
      A[t6] := t2   ; Assign A[j + 1] with temp / old value of A[j]
      -----------
  B8: |j := j + 1|
      -----------
      goto B4 ; Start new iteration of the inner loop
  B5: i := i - 1
      goto B2 ; Start new iteration of the outer loop
  out:

Code after optimization

  B1: i := n - 1
  B2: if i < 1 goto out ; boundary condition for outer loop
      --------------------   
  B3: |t2 := 0            | ; replace t2 with 0 and use addition for each iteration
      |t6 := 4            | ; replace t6 with 4 and use addition for each iteration
  B4: |t19 := 4 * i       | ; Since we are placing an addition of 4 for j each time, to compare with i we need to multiply i with 4
      |if t6 > t19 goto B5| ; boundary condition for inner loop
      --------------------
  B6: t3 := A[t2] ; Address for A[j]
      t7 := A[t6] ; Address for A[j + 1]
      if t3 <= t7 goto B8 ; Exit condition for comparision
  B7: A[t2] := t7   ; Assign A[j] with A[j + 1]
      A[t6] := t2   ; Assign A[j + 1] with temp / old value of A[j]
      -------------
  B8: |t2 := t2 + 4|
      |t6 := t6 + 4|
      --------------
      goto B4 ; Start new iteration of the inner loop
  B5: i := i - 1
      goto B2 ; Start new iteration of the outer loop
  out:

Loop Invariant Code Motion

Analysis:
- A computation is done within a loop and
- result of the computation is the same as long as we keep going around the loop
Transformation:
- move the computation outside the loop

Machine Dependent Optimizations

Register allocation
Instruction scheduling
Memory hierarchy optimizations
etc.

Local Optimizations (More Details)

Common subexpression elimination
- array expressions
- field access in records
- access to parameters

Grpah Abstractions

Example 1

grammar (for bottom-up parsing):
- E -> E + T | E - T | T, T -> T * F | F, F -> (E) | id

expression: a + a * (b - c) + (b - c) * d

  # Parse Tree                                          # DAG
          +                                              +
       /       \                                       /   \ 
      +         *                                     +      *
    /   \      /   \                                 /\      / \
  a      *    -     d                               a--*    /   d
        / \  / \                                        \  /
       a  -  b c                                         -
         / \                                            / \ 
         b  c                                          b   c

Optimized code:

  t1 = b - c
  t2 = a * t1
  t3 = a + t2
  t4 = t1 * d
  t5 = t3 + t4

Example 2 (DAG is not correct all the time)

Code:

  a = b + c
  b = a - d
  c = b + c
  d = a - d

The resulting DAG

Optimized Code

      a = b + c;
      d = a - d;
      c = d + c;

In fact, this code is incorrect

Critiques of DAGs
- Cause of problems
  - Assignment statements
  - Value of variable depends on TIME
- How to fix problem?
  - build graph in order of execution
  - attach variable name to lastest value
- Final graph created is not very interesting
  - Key: variable -> value mapping across time
  - loses appeal of abstraction

Value Numbering(VN)

More explicit with respect to VALUES, and TIME
Each value has its own “number”
- Common subexpression means same value number
var2value: current map of variable to value
- Used to determine the value number of current expression
- r1 + r2 $\rightarrow$ var2value(r1) + var2value(r2)

Algorithm

Data Structure: VALUES = 
    Table of expression //[OP, valnum1, valnum2]
             var // name of variable currently holding expression
                                  
For each instruction (dst = src1 OP src2) in execution order

    valnum1 = var2value(src1);
    varnum2 = var2value(src2);
    
    IF[OP, valnum1, valnum2] is in VALUES
        v = the index of expression
        Replace instruction with CPY dst = VALUES[v].var
    ELSE
        Add
            expression = [OP, valnum1, valnum2]
            var = dst
        to VALUES
        v = index of new entry; tv is new temporary for v
        Replace instruction with: 
            tv = VALUES[valnum1].var OP VALUES[valnum2].var
        CPY dst = tv
    set_var2value(dst, v)

More Details
What are the initial values of the variables?
- Values at beginning of the basic block
Possible implementations:
- Initialization: Create “initial values” for all variables
- Or dynamically create them as they are used
Implementation of VALUES and var2value: hash tables

Example Assign: a $\rightarrow r1$ b $\rightarrow r2$ c $\rightarrow r3$ d $\rightarrow r4$


a = b + c	ADD t1 = r2, r3 CPY r1 = t1
b = a - d	SUB t2 = r1, r4 CPY r2 = t2
c = b + c	ADD t3 = r2, r3 CPY r3 = t3 Since b is reassigned in the previous it is not the same as the first one thus it requires a new insert
d = a - d	SUB t4 = r1, r4 CPY r4 = t2

Compiler Optimization - Introduction

CSCD70 Week1

Purpose of compilers

How can the compiler improve performance

Optimization Formalization

Representation: Instructions

Optimization Example

Translated code

Representation: Basic Block

Flow Graphs

Sources of Optimizations

Local Optimizations

(Intraprocedural) Global Optimizations

Induction Variable Elimination

Loop Invariant Code Motion

Machine Dependent Optimizations

Local Optimizations (More Details)

Grpah Abstractions

Value Numbering(VN)

CATALOG

FEATURED TAGS