COSC 201: Computer Organization

Lab 10: Data Caches

Purpose
Learn how caches affect program performance.
Method
Work in teams of 2 to analyze the effects of different cache organizations. We will use MIPS code and the MARS MIPS simulator for this lab. Although the code is in a different assembly language, you should be able to see how it works. The matrix multiplication in MatMult.s does the standard three loop algorithm for multiplying two matrices. The version in MatMult3a.s uses block multiplication (see pages 427-430 in the text).
Preparation
Read chapter 5 in the text.
Files to Use
version1: MatMult.s, MatMultDriver.s
        version2: MatMult.s, MatMultDriver2.s
version3: MatMult.s, MatMultDriver3.s
        version3a: MatMult3a.s, MatMultDriver3.s
(note: files with the same name are the same)
          All files in a zip file
Cache diagrams -- can be used to map cache usage, make as many copies as you need. (Word version of diagrams)
What to Hand In
Each team must turn in a written report . The report should explain how the different types of data cache affect performance, and how alternate program organization can take advantage the cache characteristics.

The basic program is simple matrix multiplication: Z = X * Y, where X and Y are square matrices. Different versions of MatMultDriver will have different memory layouts or different size matrices. MatMult uses the standard three loop matrix multiplication algorithm, whereas MatMult3a uses a block multiplication algorithm.


Version 1

The matrices to be multiplied in this version are 8x8 matrices of integers, contiguous in memory (as you can see from the data portion of MatMultDriver.s)

In the MARS simulator select the Data Cache Simulator from the tools menu. For the first run of the program, set the cache parameters as follows:

        Placement policy: Direct Mapping
        Number of blocks: 256
        Cache Block size(words): 1

Load the MatMult.s program from the version 1 file. Set break points for the line just before the jal MatMult line and for the line just after. Assemble the files (assemble all in directory, no delayed branch). Run the simulation to the break point, click to connect the data cache simulator. Run to the second break point. Disconnect the data cache simulator. Run to the end. Note: we use the break points so we are counting the cache performance only for the multiplication, not for the printing of the matrices.

Stretch the Data Cache Simulator window up and down, if necessary, to see the cache hit and miss statistics. Record these. In your report, explain the observed miss percentage by describing how the data values are stored and located in the cache.

After this first run, change the cache simulator settings as follows:

        Placement policy: Direct Mapping
        Number of blocks: 64
        Cache Block size(words): 4

Follow the same procedure as above. In your report, explain the improved cache performance that you observe. Be precise about how the data values are stored and located in the cache.


Version 2

Follow the same procedures as in version 1. However, use the files in the version 2 folder, MatMultDriver2.s. If you look at the layout of the data in memory for this version you will see that the matrices are the same as for version 1 but are spaced out in memory.

For the first run in this version, use the cache simulator settings as follows:

        Placement policy: Direct Mapping
        Number of blocks: 64
        Cache Block size(words): 4

In your report explain why the performance deteriorates as it does. Be precise about the cause of all the misses.

For a second run for this version, use a set- associative cache using the following settings:

        Placement policy: N-way Set Associative
        Set size (blocks): 4
        Number of blocks: 64
        Cache Block size(words): 4

In your report explain the improved performance. Again, be precise.


Version 3

The previous versions get very high performance when the right cache is used because the cache size can fit all three matrices completely. In this version we use matrices that are four times larger, 16x16, as shown in MatMultDriver3.s. The multiplication program, MatMult.s, is still the same. Run the simulation as before, using the 4-way set associative version of the cache:

        Placement policy: N-way Set Associative
        Set size (blocks): 4
        Number of blocks: 64
        Cache Block size(words): 4

In your report, explain the level of performance of the cache that you observe. Map the hits and misses to the locations in the cache as well as you can.

Version 3a

In this version, the matrices are the same as in version 3, but we have altered the program to use block multiplication of the matrices. Think of each matrix as a 4x4 matrix of blocks, each of which is 4x4:

                                     B00 B01 B02 B03
                                     B10 B11 B12 B13
                                     B20 B21 B22 B23
                                     B30 B31 B32 B33

Each block for the X matrix, for example, has the 4x4 integer entries in that part of the X matrix. The the matrix multiplication proceeds by blocks. For example, the 1,2-block for the Z matrix would be computed by:

                                   ZB12 = XB10*YB02 + XB11*YB12 + XB12*YB22 + XB13*Y32

where each multiplication is a 4x4 matrix multiplication. This algorithm is given in MatMult3a.s; you should review this code to see how the block multiplication works. Load this version into the simulator and follow the same test procedures as before using the 4-way set associative cache. You can compare the result matrix that is printed to the previous version to verify that the multiplication gives the same result, just doing operations in a different order.

In your report, explain the improved cache performance.


Report

Your report should explain why the performance for each cache is observed. You should be explaining exactly where in the cache the information is stored and why this causes the number of cache hits and misses that are observed. Do this as precisely as you can for each version.

Specifically, you should discuss the advantages of 4 words per line (block) in the cache, versus 1 word per line. How would more words per line impact performance?

You should also address the advantages of a set associative cache, versus a direct mapped cache. Would 2-way set associative work as well as 4-way set associative for the programs discussed here?

You should discuss how reorganizing the data-access pattern in the program can affect cache performance. Will this technique scale up to even larger matrices? Could reorganizing the layout of data in memory also be used to improve performance?

We have seen in an earlier lab how this type of program can be written to run on multiple processors. What does your investigation of the performance of data caches suggest about how we might organize a parallel version of the matrix multiplication running on 4 processors, for example, to optimize cache performance?