Lab 8 Solution

Problem 1: Comparing Direct Mapped Caches

Loop
Length
MissesHitsCyclesCPI
Cache 1k = 32 3432,000,00032,000,1701.000
k = 33 2,000,03531,000,00041,000,1751.242
k = 64 64,000,0660320,000,3305.000
Cache 2 k = 32 2,000,01030,000,02444,000,0941.375
k = 33 2,000,01031,000,02545,000,0951.364
k = 64 18,000,01846,000,048172,000,1742.687
  1. The first instruction of the loop, <instruction a>, has a memory address of 1001 1000 1111 0000 0000 1100 0110 0100, so it will be loaded in to Cache 1 at index 11001, and the upper 25 bits of the address will be stored as the tag. The following instructions will be loaded into the following cache lines until the end of the cache is reached.

    <instruction h>, which has a memory address of 1001 1000 1111 0000 0000 1100 1000 0000, will be loaded into the first line of the cache at index 00000, and the upper 25 bits of the address will be stored as the tag. The following instructions will be loaded into the following cache lines until the end of the loop.

    The table below shows the contents of the cache after one iteration of the loop.

    indextagword
    00000 1001 1000 1111 0000 0000 1100 1 <instruction h>
    00001 1001 1000 1111 0000 0000 1100 1 <instruction i>
    00010 1001 1000 1111 0000 0000 1100 1 <instruction j>
    00011 1001 1000 1111 0000 0000 1100 1 <instruction k>
    00100 1001 1000 1111 0000 0000 1100 1 <instruction l>
    00101 1001 1000 1111 0000 0000 1100 1 <instruction m>
    00110 1001 1000 1111 0000 0000 1100 1 <instruction n>
    00111 1001 1000 1111 0000 0000 1100 1 <instruction o>
    01000 1001 1000 1111 0000 0000 1100 1 <instruction p>
    01001 1001 1000 1111 0000 0000 1100 1 <instruction q>
    01010 1001 1000 1111 0000 0000 1100 1 <instruction r>
    01011 1001 1000 1111 0000 0000 1100 1 <instruction s>
    01100 1001 1000 1111 0000 0000 1100 1 <instruction t>
    01101 1001 1000 1111 0000 0000 1100 1 <instruction u>
    01110 1001 1000 1111 0000 0000 1100 1 <instruction v>
    01111 1001 1000 1111 0000 0000 1100 1 <instruction w>
    10000 1001 1000 1111 0000 0000 1100 1 <instruction x>
    10001 1001 1000 1111 0000 0000 1100 1 <instruction y>
    10010 1001 1000 1111 0000 0000 1100 1 <instruction z>
    10011 1001 1000 1111 0000 0000 1100 1 <instruction aa>
    10100 1001 1000 1111 0000 0000 1100 1 <instruction bb>
    10101 1001 1000 1111 0000 0000 1100 1 <instruction cc>
    10110 1001 1000 1111 0000 0000 1100 1 <instruction dd>
    10111 1001 1000 1111 0000 0000 1100 1 addi $s5, $s5, 1
    11000 1001 1000 1111 0000 0000 1100 1 bne $s5, $s6, loop
    11001 1001 1000 1111 0000 0000 1100 0 <instruction a>
    11010 1001 1000 1111 0000 0000 1100 0 <instruction b>
    11011 1001 1000 1111 0000 0000 1100 0 <instruction c>
    11100 1001 1000 1111 0000 0000 1100 0 <instruction d>
    11101 1001 1000 1111 0000 0000 1100 0 <instruction e>
    11110 1001 1000 1111 0000 0000 1100 0 <instruction f>
    11111 1001 1000 1111 0000 0000 1100 0 <instruction g>

    In the first iteration, there will be 34 compulsory misses: 2 instructions before the loop, and 32 instructions in the body of the loop. At the end of the first iteration, all the instructions of the loop are in the cache, so there will be 32 hits in each of the next 1,000,000 iterations. With 5 clock cycles per miss and 1 clock cycle per hit, the total time required will be (34 × 5) + (32,000,000 × 1) = 32,000,170 clock cycles.

  2. When one instruction is added to the loop so that there are k = 33 instructions in the loop, then the first instruction and the last instruction will both be mapped to the same line in the cache. In the first iteration, there will be 35 compulsory misses: 2 instructions before the loop, and 33 instructions in the body of the loop. At the end of the first iteration, the last instruction of the loop will replace the first instruction in the cache. In each of the next 1,000,000 iterations, there will be 2 misses (for the first and last instructions) and 31 hits. With 5 clock cycles per miss and 1 clock cycle per hit, the total time required will be (2,000,035 × 5) + (31,000,000 × 1) = 41,000,175 clock cycles.

    Each additional instruction in the body of the loop will induce 2 additional misses. In the first iteration, there will be k + 2 compulsory misses. In each of the next 1,000,000 iterations, there will be (2 × k) - 64 misses and 64 - k hits. When the loop reaches k = 64 instructions, there will be k misses and no hits in each iteration. In the case of k = 64, there will be 2 + (1,000,001 × 64) misses and no hits. With 5 clock cycles per miss, the total time required will be 64,000,066 × 5 = 320,000,330 clock cycles.

  3. In the first iteration of the loop, misses occur for the two instructions before the loop. Because an entire line of the cache is replaced on a miss, <instruction a>, which has a memory address of 1001 1000 1111 0000 0000 1100 0110 0100, will already be present in the cache at index 110 with a word offset of 01. The next two instructions will also be hits. The next miss occurs with <instruction d>. The pattern of 1 miss followed by 3 hits continues until the end of the loop. When the last instruction is loaded into the cache at index 110 with offset 00, <instruction a>, <instruction b>, and <instruction c> are replaced with 3 instructions from beyond the end of the loop.

    The table below shows the contents of the cache after one iteration of the loop.

    indextag word 00 word 01 word 10 word 11
    000 1001 1000 1111 0000 0000 1100 1 <instruction h> <instruction i> <instruction j> <instruction k>
    001 1001 1000 1111 0000 0000 1100 1 <instruction l> <instruction m> <instruction n> <instruction o>
    010 1001 1000 1111 0000 0000 1100 1 <instruction p> <instruction q> <instruction r> <instruction s>
    011 1001 1000 1111 0000 0000 1100 1 <instruction t> <instruction u> <instruction v> <instruction w>
    100 1001 1000 1111 0000 0000 1100 1 <instruction x> <instruction y> <instruction z> <instruction aa>
    101 1001 1000 1111 0000 0000 1100 1 <instruction bb> <instruction cc> <instruction dd> addi $s5, $s5, 1
    110 1001 1000 1111 0000 0000 1100 1 bne $s5, $s6, loop <instruction a> <instruction b> <instruction c>
    111 1001 1000 1111 0000 0000 1100 0 <instruction d> <instruction e> <instruction f> <instruction g>

    In the first iteration there are 10 compulsory misses and 24 hits. In succeeding iterations, there will always be a miss for the first instruction in the loop, <instruction a>, and for the last instruction in the loop since they are both mapped into the same line in the cache. Thus there will be 2 misses and 30 hits in each of the next 1,000,000 iterations. With 7 clock cycles per miss and 1 clock cycle per hit, the total time required will be (2,000,010 × 7) + (30,000,024 × 1) = 44,000,094 clock cycles.

  4. When one instruction is added so that there are k = 33 instructions in the loop, then the last 2 instructions of the loop will be mapped into the same cache line as the first instruction. Because the entire line of the cache is loaded at the same time, this additional instruction will already be in the cache and will be a hit. In the first iteration, there will be 10 compulsory misses and 25 hits. In each of the next 1,000,000 iterations, there will be 2 misses and 31 hits. With 7 clock cycles per miss and 1 clock cycle per hit, the total time required will be (2,000,010 × 7) + (31,000,025 × 1) = 45,000,095 clock cycles.

    Up to 3 instructions may be added to the body of the loop without causing additional misses. For every 4 instructions added to the loop there will be 3 additional hits and 2 additional misses in each iteration. When the loop reaches k = 64 instructions, there will be 18 misses and 48 hits in the first iteration and 18 misses and 46 hits in each of the next 1,000,000 iterations. With 7 clock cycles per miss and 1 clock cycle per hit, the total time required will be (18,000,018 × 7) + (46,000,048 × 1) = 172,000,174 clock cycles.

Problem 2: Comparing a Direct Mapped Cache to a Set Associative Cache

MissesHitsCyclesCPI
Cache 2 8,000,01123,000,02379,000,1002.548
Cache 3 3,000,01028,000,02449,000,0941.581
  1. The first instruction of the loop, add $a0, $s0, 0, has a memory address of 1001 1000 1111 0000 0000 1100 0110 0000, so it will be loaded in to Cache 2 at index 110 with a word offset of 00, and the upper 25 bits of the address will be stored as the tag.

    The table below shows how the loop is mapped into Cache 2.

    indextag word 00 word 01 word 10 word 11
    000 1001 1000 1111 0000 0000 1100 1 <instruction c> <instruction d> <instruction e> <instruction f>
    001 1001 1000 1111 0000 0000 1100 1 <instruction g> <instruction h> addi $s5, $s5, 1 bne $s5, $s6, loop
    010
    011
    100
    101
    110 1001 1000 1111 0000 0000 1100 0 add $a0, $s0, 0 add $a1, $s1, 0 add $a2, $s2, 0 add $a3, $s3, 0
    111 1001 1000 1111 0000 0000 1100 0 jal function add $s4, $s4, $v0 <instruction a> <instruction b>

    The first instruction of the function, addi $sp, $sp, -16, has a memory address of 1001 1000 1111 0000 0000 1111 0110 1000, so it will be loaded in to Cache 2 at index 110 with a word offset of 10, and the upper 25 bits of the address will be stored as the tag.

    The table below shows how the function is mapped into Cache 2.

    indextag word 00 word 01 word 10 word 11
    000 1001 1000 1111 0000 0000 1111 1 <instruction r> <instruction s> <instruction t> lw $s0, 0($sp)
    001 1001 1000 1111 0000 0000 1111 1 lw $s1, 4($sp) lw $s2, 8($sp) lw $ra, 12($sp) addi $sp, $sp, 16
    010 1001 1000 1111 0000 0000 1111 1 jr $ra
    011
    100
    101
    110 1001 1000 1111 0000 0000 1111 0 addi $sp, $sp, -16 sw $s0, 0($sp)
    111 1001 1000 1111 0000 0000 1111 0 sw $s1, 4($sp) sw $s2, 8($sp) sw $ra, 12($sp) <instruction q>

    In the first iteration of the loop, 1 compulsory miss and 2 hits occur for the instructions before the loop. Then 2 compulsory misses and 3 hits occur for the body of the loop up through the jal instruction. When the function is called, 5 compulsory misses and 10 hits occur. The instructions for the function replace the instructions for the loop that were in the cache. When the function returns, 3 misses and 8 hits occur in the remainder of the loop. The instructions for the loop replace most of the instructions for the function that were in the cache.

    The table below shows the contents of the cache after one iteration of the loop.

    indextag word 00 word 01 word 10 word 11
    000 1001 1000 1111 0000 0000 1100 1 <instruction c> <instruction d> <instruction e> <instruction f>
    001 1001 1000 1111 0000 0000 1100 1 <instruction g> <instruction h> addi $s5, $s5, 1 bne $s5, $s6, loop
    010 1001 1000 1111 0000 0000 1111 1 jr $ra
    011
    100
    101
    110 1001 1000 1111 0000 0000 1111 0 add $a0, $s0, 0 add $a1, $s1, 0 addi $sp, $sp, -16 sw $s0, 0($sp)
    111 1001 1000 1111 0000 0000 1100 0 jal function add $s4, $s4, $v0 <instruction a> <instruction b>

    In the first iteration there are 11 misses and 23 hits, including the 3 instructions before the loop. Because the jal and jr instructions remain in the cache after the first iteration, they will be hits instead of misses in succeeding iterations. There will be 8 misses and 23 hits in each of the next 1,000,000 iterations. With 7 clock cycles per miss and 1 clock cycle per hit, the total time required will be (8,000,011 × 7) + (23,000,023 × 1) = 79,000,100 clock cycles.

  2. The first instruction of the loop, add $a0, $s0, 0, has a memory address of 1001 1000 1111 0000 0000 1100 0110 0000, so it will be loaded in to Cache 3 at index 10 with a word offset of 00, and the upper 26 bits of the address will be stored as the tag. The first instruction of the function, addi $sp, $sp, -16, has a memory address of 1001 1000 1111 0000 0000 1111 0110 1000, so it will be loaded in to Cache 3 at index 10 with a word offset of 10, and the upper 26 bits of the address will be stored as the tag. Because Cache 3 is 2-way set associative, there are 2 lines available for each index. Despite the fact that instructions from the loop and instructions from the function map to the same index, they can be placed in separate lines in the cache.

    When the last instruction of the function, jr $ra, is loaded into the cache at index 10, both lines are already occupied, and the tags are different from the tag for the jr instruction. In this case the LRU (least recently used) policy determines which line to replace. The jr instruction (and the 3 instructions that come after it in memory) are loaded into the first line, replacing several instructions from the loop.

    The table below shows the contents of the cache after one iteration. (loop, function)

    indextag word 00 word 01 word 10 word 11
    00 1001 1000 1111 0000 0000 1111 10 <instruction r> <instruction s> <instruction t> lw $s0, 0($sp)
    1001 1000 1111 0000 0000 1100 10 <instruction c> <instruction d> <instruction e> <instruction f>
    01 1001 1000 1111 0000 0000 1111 10 lw $s1, 4($sp) lw $s2, 8($sp) lw $ra, 12($sp) addi $sp, $sp, 16
    1001 1000 1111 0000 0000 1100 10 <instruction g> <instruction h> addi $s5, $s5, 1 bne $s5, $s6, loop
    10 1001 1000 1111 0000 0000 1111 10 jr $ra add $a1, $s1, 0 add $a2, $s2, 0 add $a3, $s3, 0
    1001 1000 1111 0000 0000 1111 01 addi $sp, $sp, -16 sw $s0, 0($sp)
    11 1001 1000 1111 0000 0000 1100 01 jal function add $s4, $s4, $v0 <instruction a> <instruction b>
    1001 1000 1111 0000 0000 1111 01 sw $s1, 4($sp) sw $s2, 8($sp) sw $ra, 12($sp) <instruction q>

    In the first iteration of the loop, 1 compulsory miss and 2 hits occur for the instructions before the loop. Then 2 compulsory misses and 3 hits occur for the body of the loop up through the jal instruction. When the function is called, 5 compulsory misses and 10 hits occur. The last instruction of the function replaces the first instructions of the loop that were in the cache. When the function returns, 2 compulsory misses and 9 hits occur in the remainder of the loop. Thus there are 10 misses and 24 hits in the first iteration, including the 3 instructions before the loop.

    In following iterations, the first instructions of the loop will replace the first instructions of the function that were in the cache (following to the LRU policy). The first instructions of the function will then replace the jr instruction from the end of the loop that was in the cache (again following the LRU policy). Finally, the jr instruction at the end of the function replaces the first instructions of the loop (again following the LRU policy). Thus there are 3 misses and 28 hits in each of the next 1,000,000 iterations.

    There are 10 misses and 24 hits in the first iteration, including the 3 instructions before the loop. There are 3 misses and 28 hits in each of the next 1,000,000 iterations. With 7 clock cycles per miss and 1 clock cycle per hit, the total time required will be (3,000,010 × 7) + (28,000,024 × 1) = 49,000,094 clock cycles.

  3. The set associative cache, Cache 3, is faster by a factor of:
    79,000,100 / (1.10 × 49,000,094) = 1.47