Computer Organization

Lab: Optimizing CPU Performance for pipelines

Purpose
Learn to avoid branch and data hazards by reordering instructions in a program. Learn to reorganize code to optimize pipeline and dual pipeline performance.
Method
Modify assembly programs to avoid branch and data hazards. Modify programs to get optimal performance for a single and for dual pipelines.
Preparation
Read chapter 4 in the text.
Files to Use
 
ArraySum.s
What to Hand In
Zip up your entire lab directory, including:
ArraySumP1.s -- no reorder, nops added for all data and branch hazards (assume accelerated branch and data forwarding)
ArraySumP2.s -- reordered to optimize assuming accelerated branch with branch delay and data forwarding
ArraySumP3.s -- use loop unrolling to optimize assuming accelerated branch with branch delay and data forwarding
 
ArraySumD1.s -- reordered (no unrolling) to optimize assuming accelerated branch with branch delay and data forwarding, using dual piepline
ArraySumD2.s -- use loop unrolling to optimize assuming accelerated branch with branch delay and data forwarding, using dual piepline

A text file summarizing the times for the different programs for base loop doing 1000 iterations, including for the singles cycle CPU.
 

Testing P1, P2, P3 versions

You can test using MARS set to use branch delay. However, data hazards are not detected by MARS, so a successful test just indicates that the logic is correct with your reordering. (Note, the initial version also works with branch delay set, but an extra instruction is executed each time around the loop.)

Testing D1, D2 versions

By alternating instructions from your two pipelines, you can test using MARS that your logic is correct, except that again, data hazards are not detected. Also note that if the slot after a branch or jump has two instructions in it (one in each pipeline), then the logic may fail in MARS.

Steps

  1. Calculate the time for ArraySum.s to execute for a single cycle CPU, assuming 1000 iterations instead of 8. (cycle time is 800 ps)
  2. Insert the needed nops into ArraySum.s assuming delayed branch and full data forwarding for the five-stage pipeline. Calculate the time, assuming 1000 iterations (cycle time 200 ps). (Save as ArraySumP1.s)
  3. Modify the ArraySum.s code to optimize for the accelerated branch with branch delay and data forwarding. (Save as ArraySumP2.s) Calculate the time, assuming 1000 iterations.
  4. Unroll the loop four iterations Use the same assumptions as for step 3. Rearrange code to minimize the need for nops, assuming data forwarding. Calculate the time, assuming 1000 iterations.(Save as ArraySumP3.s)
  5. Assume that you have a dual pipeline architecture. One pipeline only does load and stores, the other does all other instructions. Accelerated branch with branch delay and full data forwarding (even across pipelines). Give the best organization of the code with no loop unrolling. Calculate the time, assuming 1000 iterations.(Save as ArraySumD1.s)
  6. Use loop unrolling and register renaming to optimize the code for the dual pipeline described in step 6. Calculate the time, assuming 1000 iterations.(Save as ArraySumD2.s)