COSC 201: Computer Organization

Lab 10: CPU Performance

Purpose
Compare the performance of different MIPS CPU implementations.
Method
Calculate the execution time for different CPU implementations, using optimized code.
Preparation
Read chapters 5 and 6 in the textbook.
Files to Use
main.s, array.s
What to Hand In
  1. Code for each case in files with the following names: array.3.1.s, array.3.2.s, array.3.3.s, array.4.1.s, and array.4.2.s. In cases 4.1 and 4.2 be sure to indicate pairs of instructions that are launched in the same clock cycle.
  2. Written analysis in a file named array.txt (or array.html or array.doc) that gives the time to execute the code for 1000 iterations of the loop for each case. Your report should include a summary for each case in a table similar to the one below.

The code given below does an element by element addition of two arrays, storing the result in a third array. The array elements are accessed by pointer hopping, as might be done by optimized code. You are to analyze the code, calculating the actual execution time for each of the following cases:

  1. Single cycle CPU, cycle time 8 ns
  2. Multicycle CPU, cycle time 2 ns
  3. Pipelined CPU, cycle time 2 ns
    1. Unoptimized code: load and branch delay slots are filled with nops.
    2. Optimized code: load and branch delay slots are filled with useful instructions, if possible, by reordering instructions.
    3. Unrolled code: the loop is unrolled four times (assume the array size is a multiple of four), and load and branch delay slots are filled with useful instructions, if possible.
  4. Superscalar (2-way) pipelined, cycle time 2 ns
    1. Optimized code: load and branch delay slots are filled with useful instructions, if possible, by reordering instructions. Instruction pairs are launched together when possible.
    2. Unrolled code: the loop is unrolled four times (assume the array size is a multiple of four), and load and branch delay slots are filled with useful instructions, if possible.

We do not worry about cache issues for the analysis and assume that the arrays hold 1000 elements.

For the pipelined CPU we assume that the instruction immediately after a load must not use the register being loaded into. If necessary, the compiler inserts a nop instruction in the simple case (3.1) and for case (3.2) optimizes the code by moving instructions to fill load and branch delay slots. For case (3.3), we do four iterations each time around the loop, and see if we can save any additional time from the reduced number of branches and by not computing the addresses every time for a load. This assumes the number of elements in the array is a multiple of four.

For the superscalar CPU, any arithmetic or branch instruction can be paired with any memory access instruction (lw or sw) and launched in the same clock cycle. The two instructions that are launched at the same time may not depend on each other. In addition, the instructions launched in the clock cycle immediately after a load may not use the register that the load writes to and the instructions launched in the clock cycle immediately after a branch are always executed (branch delay slot).

For the superscalar loop-unrolling case, we do four iterations of the loop in sequence, then loop back. Combining this with reordering of instructions, plus eliminating three branch instructions, can make the superscalar case even faster.

Hints:

  1. Single cycle CPU: this case is easy since every instruction takes one clock cycle.
  2. Multicycle CPU: this case is also easy, each instruction takes the number of clock cycles specified in the textbook.
  3. Pipelined CPU
    1. Unoptimized code: just put nops in where needed and then one instruction completes every clock cycle.
    2. Optimized code: replace nops by other instructions without changing the logic.
    3. Unrolled code: see pages 513-514 in the textbook for loop unrolling. You may also need to use one or more additional registers to avoid dependencies. Remember that the loop only executes one fourth as many times now.
  4. Superscalar CPU
    1. Optimized code: pair each lw and sw instruction with one that comes before or after, so that dependency restrictions are not violated. You may still need one nop.
    2. Unrolled code: pair each lw and sw instruction with one that comes before or after, but move instructions to optimize. You may also need to use one or more additional registers to avoid dependencies. Remember that the loop only executes one fourth as many times now.

C++ code

void array(int len, int A[], int B[], int C[])
{
    int k = len;
    do
    {
        k--;
        C[k] = A[k] + B[k];
    } while (k > 0);
}

Instructions Cycles Time (ns)
1 Single Cycle
2 Multicycle
3.1 Pipelined (Unoptimized)
3.2 Pipelined (Optimized)
3.3 Pipelined (Unrolled)
4.1 Superscalar (Optimized)
4.2 Superscalar (Unrolled)