COSC 201: Computer Organization

Lab 4: Synchronization of parallel threads in Assembly Language

Purpose
Learn how to use ll and sc instructions to coordinate parallel threads with shared memory.
Method
Write assembly code to modify a sequential program to work in parallel.
Preparation
Read Chapter 2 in the text.
Files to Use
        sequential code: MatMultDriver.s, FPMatMult.s
parallel code start: MatMultDriverPar4x4.s, FPMatMulParallel4x4.s
        general parallel code start: MatMultDriverPar.s, FPMatMulParallel.s

What to Hand In
Completed FPMatMultParallel4x4XX.s, FPMatMulParallelXX.s and parallelMatMultXX.txt (XX are your initials or name)

1. The sequential code files carry out a computation for 4x4 matrices of doubles: X = X + Y*Z. This code uses three nested loop, the outer two loops run over the indices, i, j, of the X matrix to be computed, and the inner loop completes the computation for the new value of that X(i, j). Run the code a step at a time

Your job is to change this to a parallel computation that could run on two, four, or more cores or threads.

It is easiest to do this in two steps. First step is to replace the nested loops for i and j that each run 0 to 3 for the two indices, to a single loop that has an index job that runs 0 to 15. For each value of job, read the j index from the low order two bits of the job index and read the i index from the next two bits of the job index. (Hint: use shift and mask operations.)

The second step is to change this to parallel code by adding a location in memory immediately after the matrices labeled count and initially containing a word with value 0. This is done in the driver code. The address of this location is passed as an additional parameter, via $a3. In the matrix multiplication function, rather than having an outer loop that runs over the job index, the value of the job index should be read from the given memory location. This should be done using the load-linked and store-conditional instructions so as to avoid a race condition (think fetch-and-increment), so that different processors always get different indices and all the job indices are done by some processor. Any processor that gets a job value greater than 15 should terminate. Otherwise the inner loop should run as specified in the sequential code. If your code is correct, it should run and give the same results as the sequential code, even using the ll and sc instructions.


2. Write a text (or word) file, parallelMatMultXX, that describes the following:
Run your parallel code a step at a time and observe how the job count is loaded into $t3. Then keep running the code but when the multiplication routine is running (a step at a time), double click the memory location where job count is stored and change its value to a higher value. The next time your code fetches a job, what is it? If you change the value in the memory location between your load-linked and store conditional instructions, does this affect the order of execution (as it should if another processor accessed the memory)? Why or why not?


3. Modify FPMatMulParallel.s so that it does the calculations for any set of matrices that have dimensions (row and column length) that are a power of two, e.g. 4, 8, 16, 32, etc. Assume that a fifth parameter is passed from the driver to the function using $v0. (Note: although the a-registers are designated for arguments in the MIPS standard, there is no reason we cannot use $v0 and $v1 in the same way, since that does not keep them from being used to return values at the end of the function.) This is done by the driver, MatMultDriverPar.s. You will need to use the information given in $v0 to determine various values needed: length of row, for the inner k loop, total number of jobs, to use to determine when computation is done, the value to use for a mask and the amount to shift for pulling the i and j values from the job-count value read from memory. Note: you will need to use the shift instruction sllv r1, r2, r3 which shifts the bit-string in r2 left by the value in r3 (which must be less than 32) and stores the results in r1. There is a similar srlv instruction. You can test your FPMatMulParallel.s code by using the MatMultDriverPar.s and vary the size of the matrices by changing the value placed into v0 up to 4.


Hand in your files FPMatMultParallel4x4XX.s, FPMatMulParallelXX.s and parallelMatMultXX.txt (XX are your initials or name) by emailing them to the instructor.