From b283d976996ed54bab96e36452ef0260b6d7d7a8 Mon Sep 17 00:00:00 2001
From: Carl Pearson <pearson@illinois.edu>
Date: Fri, 1 Feb 2019 13:34:40 -0600
Subject: [PATCH] update spgemm readme

---
 labs/sgemm-regtiled-coarsened/README.md | 41 +++++--------------------
 1 file changed, 8 insertions(+), 33 deletions(-)

diff --git a/labs/sgemm-regtiled-coarsened/README.md b/labs/sgemm-regtiled-coarsened/README.md
index 220712f..d88400e 100644
--- a/labs/sgemm-regtiled-coarsened/README.md
+++ b/labs/sgemm-regtiled-coarsened/README.md
@@ -1,24 +1,13 @@
-# 7-point Stencil with Thread-coarsening and Register Tiling
+# Matrix Multiplication with Thread Coarsening and Register Tiling
 
-## Objective
-The purpose of this lab is to practice the thread coarsening and register tiling optimization techniques using 7-point stencil as an example.
+## Objective 
+The purpose of this lab is to practice the thread coarsening and register tiling optimization techniques using matrix-matrix multiplication as an example.
 
-## Procedure
-1. Edit the `kernel` function in `template.cu` to implement a 7-point stencil (refer to the [lecture slides](https://bw-course.ncsa.illinois.edu/mod/resource/view.php?id=574)) with combined register tiling and x-y shared memory tiling, and thread coarsening along the z-dimension.
-
-    ```
-    out(i, j, k) =  C0 *in(i, j, k)
-                  + C1 * (  in(i-1, j, k)
-                          + in(i, j-1, k)
-                          + in(i, j, k-1)
-                          + in(i+1, j, k)
-                          + in(i, j+1, k)
-                          + in(i, j, k+1) )
-    ```
-
-2. Edit the `launchStencil` function in `template.cu` to launch the kernel you implemented. The function should launch 2D CUDA grid and blocks, where each thread is responsible for computing an entire column in the z-deminsion.
-
-    `A0` and `Anext` in the code template correspond to `in` and `out`, respectively. The output dimension of the 7-point stencil computation is one smaller than the input dimension on both sides for all boundaries (e.g., output dimension is 6x6x6 for an input of 8x8x8). Only those "internal" elements needs to be calculated.
+## Procedure 
+\noindent \textbf{Step 1:} [Instructions on how to retrieve the new lab package.]
+\\
+\\
+Edit the file `template.cu` to launch and implement a matrix-matrix multiplication kernel that uses thread coarsening and register tiling optimization techniques. The first input matrix has a column major layout and shall be tiled in the registers, the second input matrix has a row major layout and shall be tiled in shared memory, and the output matrix has a column major layout and shall be tiled in the registers. Macros have been provided to help you with accessing these matrices easily.
 
 3. Test your code using rai
 
@@ -27,17 +16,3 @@ The purpose of this lab is to practice the thread coarsening and register tiling
     Be sure to add any additional flags that are required by your course (`--queue` or others).
 
 4. Submit your code on rai
-
-## Other notes
-
-To simplify the kernel code, you do not need to support input data with z-extent less than 2.
-
-The data is stored in column-major order. For example, you might consider using a macro to simplify your data access indexing:
-
-```c++
-__global__ void kernel(...) {}
-    #define A0(i, j, k) A0[((k)*ny + (j))*nx + (i)]
-    // your kernel code
-    #undef A0
-}
-```