CSC362 2004S.L02 (Class 41): Code Improvement

Admin:
* Online grading has arrived.  YA reason to hack PioneerWeb.
* Custom course evaluation forms now available.  Please bring a *filled out* form to class tomorrow.
* Class Friday meets in the normal place.  Ananta and Sam will pick up the food.  Get your orders to Ananta asap.
* You have now reached the top of my grading heap.
  * It's Pub Grading Night!
* Final questions on the translator?
  * Yes, it is due tomorrow at 1:15 p.m.  
  * Yes, starting yesterday was a bad idea.

Overview:
* Improvement basics
  * Goals
  * Places to improve
  * Side note: Loops
[* Detour: Flow Graphs and Dataflow]
* From intermediate to actual code (code generation)
* Favorite peephole optimizations

Improvement is as old as compilation
* Fortran: 
  * In the longagos: 1950's at IBM
  * First goal: Transportability
  * Second goal: Ease of programming
  * Third goal: Produce code "(almost) as good as the average programmer"
* Today:
  * Typical commercial compilers produce code better than almost any assembly language programmer (architectures are more complicated, lots of code improvement techniques developed)
* Unfortunate name for the process of improving code to make it more efficient: Optimization

* Most improvement focuses on running time
  * Code footprint (length of the code)
  * Memory use
  * Power usage (if you have a laptop; kinds of operations)
* Observation: Lots of subtle issue

type
  matrix = array[1..N,1..N] of integer;
var
  a, b: matrix;

begin
  ...
  { copy b to a; the details of b := a; }
  for i := 1 to N do
    for j := 1 to N do
      b[i,j] := a[i,j]
 
vs. 

  for i := 1 to N do
    for j := 1 to N do
      b[j,i] := a[j,i]

Claim: The way the arrays are arranged in memory affects which of these runs faster.

Observation: Two basic ways to arrange two-dimensional arrays
Row-major: A[1,1],A[1,2],A[1,3]...A[1,N],A[2,1],A[2,2]....,A[2,N],...
Column-major: A[1,1],A[2,1],A[3,1]...,A[N,1],A[2,2],...

If your arrays are big enough (say, you do Physics programming or work for
Google), the second for loop will do lots of paging.

To solve this problem:
* Programmer: Arrange the order of the array to match the loop
* Programmer: Arrange the loop to match the order of the array
* Compiler: Detrmine the access patterns and arrange the array to match the access pattern

Observations:
* Need to think about more than just the assembly code produced
* Improvement can be done at different compilation levels

The programmer can:
* Choose the right algorithm for the problem!
* [Analyze and determine the "hot spots" and work on those.]
* Use integers rather than floats or doubles.
* Turn tail-recursive functions into loops (unless your compiler does)
* Lots of special-purpose stuff, like loop unrolling

Loop unrolling:
* Given a loop, reproduce the loop by writing the body multiple times
for i := 1 to 5 do
  writeln(i);

writeln(1);
writeln(2);
writeln(3);
writeln(4);
writeln(5);

  MOV $1 i
LOOP:
  ARG i
  CALL writeln
  JEQ i $5 ENDLOOP
  IADD $1 i -> i
  JUMP LOOP
ENDLOOP:

Unroll
  ARG $1
  CALL writeln
  ARG $2
  CALL writeln
  ARG $3
  CALL writeln
  ARG $4
  CALL writeln
  ARG $5
  CALL writeln
ENDLOOP:

Direct improvement: Fewer instructions
Indirect improvement: Faster on pipelined architectures (no need to flush the pipe)

The translator (from tree to intermediate code)
* Should not do anything
* Better to produce clear code you can prove correct than unclear fast code

Intermediate code
* Can be "peephole improved"
* Can have some global analyses applied (like loop unrolling and function inlining)

Intermediate to Assembly
* Choose the best translation (which is often architecture-specific)

At assembly level
* Move peephole optimization

Typical peephole optimizations:

* Copy propagation: 
  MOVE x y
  ...
  ADD x x -> z
  * Advantages: If we are lucky, we eliminate *all* uses of y, Eliminate the MOVE, save temporaries, increase chance of successful register allocation
* "Shared code elimination"
  ADD x y -> q
  ...
  ADD x y -> r
  * Replace by
  ADD x y -> q
  ...
  MOVE q r
  * A move is usally faster than an add
  * Permits additional copy propagation
 
Final optimization: Simplify math 
  IMULT $2 a -> b
can be
  LSHIFT a $1 -> b
or
  IADD a a -> b