COSC 201: Computer Organization

Lab 4: Synchronization of parallel threads in Assembly Language

Purpose
Learn how to use ll and sc instructions to coordinate parallel threads with shared memory.
Method
Write assembly code to modify a sequential program to work in parallel.
Preparation
Read Chapter 2 in the text.
Files to Use
          sequential code: MatMultDriver.s, FPMatMult.s
parallel code start: MatMultDriverPar.s, FPMatMulParallel.s
What to Hand In
Completed FPMatMultParallelXX.s and generalMatMultXX.txt (XX are your initials or name)

The sequential code files carry out a computation for 4x4 matrices of doubles: X = X + Y*Z. This code uses three nested loop, the outer two loops run over the indices, i, j, of the X matrix to be computed, and the inner loop completes the computation for the new value of that X(i, j). Run the code a step at a time

Your job is to change this to a parallel computation that could run on two, four, or more cores or threads.

It is easiest to do this in two steps. First step is to replace the nested loops for i and j that each run 0 to 3 for the two indices, to a single loop that has an index job that runs 0 to 15. For each value of job, read the j index from the low order two bits of the job index and read the i index from the next two bits of the job index. (Hint: use shift and mask operations.)

The second step is to change this to parallel code by adding a location in memory immediately after the matrices labeled count and initially containing a word with value 0. This is done in the driver code. The address of this location should be passed as an additional parameter. In the matrix multiplication function, rather than having an outer loop that runs over the job index, the value of the job index should be read from the given memory location. This should be done using the load-linked and store-conditional instructions so as to avoid a race condition (think fetch-and-increment), so that different processors always get different indices and all the job indices are done by some processor. Any processor that gets a job value greater than 15 should terminate. Otherwise the inner loop should run as specified in the sequential code. If your code is correct, it should run and give the same results as the sequential code, even using the ll and sc instructions.

Write a text (or word) file, generalMatMultXX, that describes two things:

  1. Run your parallel code a step at a time and observe how the job count is loaded into $t3. Then keep running the code but when the multiplication routine is running (a step at a time), double click the memory location where job count is stored and change its value to a higher value. The next time your code fetches a job, what is it? If you change the value in the memory location between your load-linked and store conditional instructions, does this affect the order of execution (as it should if another processor accessed the memory)?
  2. Describe how you would need to change the code so that you could multiply matrices that had dimensions that were any power of two (e.g. 1024x1024 matrices).

Hand in your file FPMatMultParallelXX.s and your file generalMatMultXX.txt (XX are your initials or name) by emailing them to the instructor.