COSC 201: Computer Organization
Lab 10:
CPU Performance
- Purpose
-
Compare the performance of different MIPS CPU implementations.
- Method
-
Calculate the execution time for different CPU implementations, using
optimized code.
- Preparation
-
Read chapters 5 and 6 in the textbook.
- Files to Use
- main.s, array.s
- What to Hand In
-
-
Code for each case in files with the following names: array.3.1.s,
array.3.2.s, array.3.3.s, array.4.1.s, and array.4.2.s. In cases 4.1
and 4.2 be sure to indicate pairs of instructions that are launched in
the same clock cycle.
-
Written analysis in a file named array.txt (or array.html or
array.doc) that gives the time to execute the code for 1000
iterations of the loop for each case.
Your report should include a summary for each case in a table similar
to the one below.
The code given below does an element by element addition of two
arrays, storing the result in a third array. The array elements are
accessed by pointer hopping, as might be done by optimized code. You
are to analyze the code, calculating the actual execution time for
each of the following cases:
-
Single cycle CPU, cycle time 8 ns
-
Multicycle CPU, cycle time 2 ns
-
Pipelined CPU, cycle time 2 ns
-
Unoptimized code: load and branch delay slots are filled with nops.
-
Optimized code: load and branch delay slots are filled with useful
instructions, if possible, by reordering instructions.
-
Unrolled code: the loop is unrolled four times (assume the array size
is a multiple of four), and load and branch delay slots are filled
with useful instructions, if possible.
-
Superscalar (2-way) pipelined, cycle time 2 ns
-
Optimized code: load and branch delay slots are filled with useful
instructions, if possible, by reordering instructions. Instruction
pairs are launched together when possible.
-
Unrolled code: the loop is unrolled four times (assume the array size
is a multiple of four), and load and branch delay slots are filled
with useful instructions, if possible.
We do not worry about cache issues for the analysis and assume that
the arrays hold 1000 elements.
For the pipelined CPU we assume that the instruction immediately after
a load must not use the register being loaded into. If necessary,
the compiler inserts a nop instruction in the simple case (3.1) and for case
(3.2) optimizes the code by moving instructions to fill load and branch
delay slots. For case (3.3), we do four iterations each time around
the loop, and see if we can save any additional time from the reduced number
of branches and by not computing the addresses every time for a load.
This assumes the number of elements in the array is a multiple of four.
For the superscalar CPU, any arithmetic or branch instruction can be
paired with any memory access instruction (lw or sw) and launched in
the same clock cycle. The two instructions that are launched at the
same time may not depend on each other. In addition, the instructions
launched in the clock cycle immediately after a load may not use the
register that the load writes to and the instructions launched in the
clock cycle immediately after a branch are always executed (branch
delay slot).
For the superscalar loop-unrolling case, we do four iterations of the
loop in sequence, then loop back. Combining this with reordering
of instructions, plus eliminating three branch instructions, can make
the superscalar case even faster.
Hints:
-
Single cycle CPU: this case is easy since every instruction takes one
clock cycle.
-
Multicycle CPU: this case is also easy, each instruction takes the
number of clock cycles specified in the textbook.
-
Pipelined CPU
-
Unoptimized code: just put nops in where needed and then one
instruction completes every clock cycle.
-
Optimized code: replace nops by other instructions without changing
the logic.
-
Unrolled code: see pages 513-514 in the textbook for loop unrolling.
You may also need to use one or more additional registers to avoid
dependencies. Remember that the loop only executes one fourth as many
times now.
-
Superscalar CPU
-
Optimized code: pair each lw and sw instruction with one that comes
before or after, so that dependency restrictions are not violated.
You may still need one nop.
-
Unrolled code: pair each lw and sw instruction with one that comes
before or after, but move instructions to optimize. You may also need
to use one or more additional registers to avoid dependencies.
Remember that the loop only executes one fourth as many times now.
C++ code
void array(int len, int A[], int B[], int C[])
{
int k = len;
do
{
k--;
C[k] = A[k] + B[k];
} while (k > 0);
}
| Instructions | Cycles | Time (ns) |
| 1 Single Cycle |
| | |
| 2 Multicycle |
| | |
| 3.1 Pipelined (Unoptimized) |
| | |
| 3.2 Pipelined (Optimized) |
| | |
| 3.3 Pipelined (Unrolled) |
| | |
| 4.1 Superscalar (Optimized) |
| | |
| 4.2 Superscalar (Unrolled) |
| | |