This is the skeleton code for the Fall 2019 ECE408 / CS483 / CSE408 course project. In this project, you will:
- Get practical experience by using, profiling, and modifying MXNet, a standard open-source neural-network framework.
- Demonstrate command of CUDA and optimization approaches by designing and implementing an optimized neural-network convolution layer forward pass.
The project will be broken up into 4 milestones and a final submission. Read the description of the final report before starting, so you can collect the necessary info along the way. Each milestone (except milestone 1) will consist of an updated report (culminating in the final report). Append each milestone's deliverable at the beginning of the document such that your latest milestone is at the beginning of the report.
You will be working in teams of 3 (no excuse here). Chicago city scholars can form teams with on campus students.
You are expected to adhere to University of Illinois academic integrity standards. Do not attempt to subvert any of the performance-measurement aspects of the final project. If you are unsure about whether something does not meet those guidelines, ask a member of the teaching staff.
- Milestone 1: Due 10/06/2019 @5pm
- Milestone 2: Due 10/12/2019 @5pm
- Milestone 3: Due 10/19/2019 @5pm
- Milestone 4: Due 11/21/2019 @5pm
- Final Submission: Due 12/19/2019 @5pm
- Rubric
- Final Report
- Extras
Due October 06 @ 5pm
Deliverables |
---|
Register your team in the google sheet. |
You and your team should agree on a team name and enter it in this google sheet. Graduate students can use this google form.
Due October 12 @ 5pm
As with all milestones, you will include an updated PDF report.pdf
in the project directory you submit with rai.
This report should contain all of the deliverables.
This report should contain your names, netids, rai ids (if different), team names, and school affiliation (Chicago Scholars or on campus students).
Deliverables |
---|
Report: Include a list of all kernels that collectively consume more than 90% of the program time. |
Report: Include a list of all CUDA API calls that collectively consume more than 90% of the program time. |
Report: Include an explanation of the difference between kernels and API calls |
Report: Show output of rai running MXNet on the CPU |
Report: List program run time |
Report: Show output of rai running MXNet on the GPU |
Report: List program run time |
Create a CPU implementation |
Report: List whole program execution time |
Report: List Op Times |
Use rai -p <project folder> --queue rai_amd64_ece408 --submit=m2 to mark your job for grading |
Clone this repository to get the project folder.
git clone https://github.com/illinois-impact/ece408_project.git
Download the rai binary for your platform from here. You will probably use it for development, and definitely use it for submission.
You should have received a .rai_profile
file by email.
Put that file in ~/.rai_profile
(Linux/macOS).
Your .rai_profile
should look something like this (indented with space!)
profile:
firstname: <your-given-name>
lastname: <your-surname>
username: <your-username>
email: <your-institution-email>
access_key: <your-access-key>
secret_key: <your-secret-key>
affiliation: uiuc
You will need to add your team name in the following way:
profile:
firstname: <your-given-name>
lastname: <your-surname>
username: <your-username>
email: <your-institution-email>
access_key: <your-access-key>
secret_key: <your-secret-key>
affiliation: uiuc
team:
name: <your-team-name>
Some more info is available on the Client Documentation Page.
Run the built-in MXNet forward pass using rai
Consult m1.1py
to examine the neural-network architecture used in this project.
Use RAI to run a batch forward pass on some test data.
rai -p <project-folder> --queue rai_amd64_ece408
This will upload your project directory to rai (running on AWS) and move it to /src
, where the execution specified in rai_build.yml
will occur.
The image:
key specifies the environment that the rest of the execution will occur in.
This environment includes a prebuilt MXNet (so rai will only do a partial compile with your code) as well as the model definition and the training data.
The resources:
key specifies what computation resources will be available to the execution.
The commands:
key specifies the recipe that rai will execute. First, the project files are copied to the /build
directory.
Then the files in ece408_src
are copied to src/operator/custom/
in the MXNet source tree.
MXNet is recompiled, and the Python bindings are installed.
python /src/m1.1.py
runs the m1.1.py
python program.
You should see the following output:
Loading fashion-mnist data... done
Loading model... done
New Inference
EvalMetric: {'accuracy': 0.8154}
Modify rai_build.yml
to use /usr/bin/time
to measure the elapsed time of the whole program.
- /usr/bin/time python m1.1.py
Next, we will run on the GPU!
Compare m1.2.py
and m1.1.py
. You'll see that it is the same, except for mx.gpu()
has been substituted for mx.cpu()
. This is how we tell MXNet that we wish to use a GPU instead of a CPU.
Modify rai_build.yml
to time python m1.2.py
Again, submit the job to rai
rai -p <project-folder> --queue rai_amd64_ece408
Next, we will learn how to use nvprof
to profile the execution
Once you've gotten the appropriate accuracy results, generate a profile using nvprof. You will be able to use nvprof to evaluate how effective your optimizations are.
As described above, make sure rai_build.yml
is configured for a GPU run.
Then, modify rai_build.yml
to generate a profile instead of just execuing the code.
nvprof python m1.2.py
You should see something that looks like the following:
==278== NVPROF is profiling process 278, command: python m1.2.py
Loading model... done
New Inference
EvalMetric: {'accuracy': 0.8154}
==15163== Profiling application: python m1.2.py
==15163== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 39.80% 16.602ms 20 830.11us 1.1200us 16.092ms [CUDA memcpy HtoD]
20.28% 8.4577ms 1 8.4577ms 8.4577ms 8.4577ms void cudnn::detail::implicit_convolve_sgemm
11.89% 4.9587ms 1 4.9587ms 4.9587ms 4.9587ms volta_cgemm_64x32_tn
7.11% 2.9642ms 2 1.4821ms 25.760us 2.9384ms void op_generic_tensor_kernel
...
API calls: 42.14% 3.03300s 22 137.86ms 13.006us 1.56281s cudaStreamCreateWithFlags
34.07% 2.45202s 24 102.17ms 117.07us 2.44545s cudaMemGetInfo
21.32% 1.53449s 19 80.763ms 805ns 407.00ms cudaFree
1.18% 84.772ms 912 92.951us 308ns 38.118ms cudaFuncSetAttribute
0.47% 33.977ms 9 3.7753ms 33.322us 16.253ms cudaMemcpy2DAsync
...
The GPU Activities section shows the kernels and memory transfers, and the API calls section shows the CUDA API calls that are executed. There are columns corresponding to percentage of time consumed, total time, number of calls, and average/min/max time of those calls. Think about the distinction between a CUDA API call and a kernel launch, and describe it briefly in your report. The CUDA documentation describes kernels and the programming interface.
You can find more information about nvprof in the CUDA Toolkit Documentation
See the description of the skeleton code for background information, including the data storage layout of the tensors.
Modify ece408_src/new-forward.h
to implement the forward convolution described in Chapter 16 of the textbook.
The performance of the CPU convolution is not part of the project evaluation.
The algorithm is also below, for your convenience
for b = 0 .. B // for each image in the batch
for m = 0 .. M // for each output feature maps
for h = 0 .. H_out // for each output element
for w = 0 .. W_out
{
y[b][m][h][w] = 0;
for c = 0 .. C // sum over all input feature maps
for p = 0 .. K // KxK filter
for q = 0 .. K
y[b][m][h][w] += x[b][c][h + p][w + q] * k[m][c][p][q]
}
Unlike the convolutions described in the class, note that this one is not centered on the input image.
Because this operator is different than the built-in MXNet operator, you will need to load a different model.
m2.1.py
handles this for you.
Modify rai_build.yml
to invoke
python m2.1.py
When your implementation is correct, you should see output like this:
Loading fashion-mnist data... done
Loading model... done
New Inference
Op Time: 10.906517
Op Time: 58.887046
Correctness: 0.7653 Model: ece408
Every time your layer is invoked, it will print the "Op Time," the time spent working on that layer.
Since the network has two convolutional layers, two times will be printed.
You can time the whole program execution by modifying rai_build.yml
with
/usr/bin/time python m2.1.py
m2.1.py
takes one optional argument: the dataset size.
If the correctness for each possible model is as below, you can be reasonably confident your implementation is right.
The correctness does depend on the data size.
For example, to check your correctness on the full data size of 10000, you could modify rai_build.yml
to run
python m2.1.py 10000
Model | Number of Images | Correctness |
---|---|---|
ece408 | 100 | 0.76 |
ece408 | 1000 | 0.767 |
ece408 | 10000 (default) | 0.7653 |
(Final model that will be used for internal evaluation shall be different.)
The provided m2.1.py
is identical to the one used by --submit=m2
.
You may modify m2.1.py
as you please, but check that --submit=m2
will still invoke your code correctly.
Use
rai -p <project folder> --queue rai_amd64_ece408 --submit=m2
to mark your submission.
Due October 19 @ 5pm
Deliverables |
---|
Everything from Milestone 2 |
Implement a GPU Convolution |
Correctness and timing with 3 different dataset sizes |
Report: demonstrate nvprof profiling the execution |
Use rai -p <project folder> --queue rai_amd64_ece408 --submit=m3 to mark your job for grading |
Modify ece408_src/new-forward.cuh
to create GPU implementation of the forward convolution.
Modify rai_build.yml
to run
python m3.1.py
to use your GPU implementation. When it is correct, it will show the same correctness as Milestone 2.
First, ensure you are using correct image in rai_build.yml file
image: illinoisimpact/ece408_mxnet_docker:amd64-gpu-latest-fa19
Modify rai_build.yml
to use nvprof to save some timeline and analysis information, as described in nvprof.
Use the NVIDIA Visual Profiler to find the execution of your kernel, and show it in your report.
The NVVP on EWS section describes how to install NVVP.
Use
rai -p <project folder> --queue rai_amd64_ece408 --submit=m3
to mark your submission.
m3.1.py
takes one optional argument: the dataset size.
If the correctness for each possible model is as below, you can be reasonably confident your implementation is right.
The correctness does depend on the data size.
For example, you could modify rai_build.yml
to run
python m3.1.py
Model | Number of Images | Correctness |
---|---|---|
ece408 | 100 | 0.76 |
ece408 | 1000 | 0.767 |
ece408 | 10000 (default) | 0.7653 |
(Final model that will be used for internal evaluation shall be different.)
Due November 21 @ 5pm
Deliverables |
---|
Everything from Milestone 3 |
Implement three GPU optimizations |
Report: Describe the optimization |
Report: demonstrate nvprof profiling the execution |
Report: use NVVP to analyze your optimization |
Use rai -p <project folder> --queue rai_amd64_ece408 --submit=m4 to mark your job for grading |
For this milestone, you should attempt at least three GPU optimizations (see optimizations).
Describe the optimizations in your report.pdf
.
Use the NVIDIA Visual Profiler and your analysis information to describe the effect that your optimizations had on the performance of your convolution. If possible, you should try to separate the effect of each optimization in your analysis.
Use
rai -p <project folder> --queue rai_amd64_ece408 --submit=m4
to submit your project folder.
Due December 19 @ 5pm
Deliverables |
---|
Everything from Milestone 4 |
Implement final GPU optimizations |
Report: Describe and analyze the optimizations |
Report: demonstrate nvprof profiling the execution |
Use rai -p <project folder> --queue rai_amd64_ece408 --submit=final to mark your job for grading |
Optimize your GPU convolution (see optimizations).
Your implementation must work with rai -p <project-folder> --queue rai_amd64_ece408 --submit=final
.
This means all your source files must be in ece408_src
, and your implementation must work when they are copied to src/operator/custom
in the MXNet tree, and make
is invoked on the MXNet tree.
This is done in the provided rai_build.yml
.
Likewise, the provided final.py
provides an example of the script that will be used to time your implementation.
All of your code for this and the later milestones must be executed between auto start = ...
and auto end = ...
in new-inl.h
.
The easiest way to ensure this is that all of your code should be in forward()
or called by forward()
from new-forward.cuh
or new-forward.h
.
Do not modify any timing-related code.
Use rai -p <project folder> --queue rai_amd64_ece408 --submit=final
to submit your project folder.
You've been building this final report through all the milestones. Keep the content from the earlier milestones, but be sure to include the following:
- Your team name
- Your team member names
- your netids
- your UINs
The final report should include at least the following information for each optimization
- Optimization Approach and Results
- how you identified the optimization opportunity
- why you thought the approach would be fruitful
- the effect of the optimization. was it fruitful, and why or why not. Use nvprof and NVVP to justify your explanation.
- Any external references used during identification or development of the optimization
- How your team organized and divided up this work.
- References (as needed)
- (Optional) Suggestions for Improving Next Year
The overall project score will be computed as follows:
- Milestone 1 ( 5% )
- Milestone 2 ( 10% )
- Milestone 3 ( 10% )
- Milestone 4 ( 30% )
- Optimization 1 ( 10% )
- Optimization 2 ( 10% )
- Optimization 3 ( 10% )
- Final Optimizations ( 30% )
- Optimization 4 ( 10% )
- Optimization 5 ( 10% )
- Optimization 6 ( 10% )
- Additional Optimizations / detailed insights ( up to +10% extra!!! )
- Performance Ranking ( 10% )
- Report Style (5 %)
- Clear, concise writing, good layout, and good organization will be rewarded.
Each optimization will be graded as follows:
- Explanation of Performance Impact ( 40% )
- Correctness ( 60% )
The Performance Ranking will be graded as follows:
- The median performance will be determined (how well the class did as a whole)
- Your performance will be converted to a number of standard deviations above/below that median (how well you did compared to the class).
- That value will be linearly mapped into the space of 0-10 to determine the ranking grade.
The ranking is determined by the total run time of the two layer invocations.
If your implementation is not correct, you will get a 0 for this component of the grade.
The rai ranking
command is not the final word: the staff will re-run all final submissions multiple times and choose the fastest result as your time.
THe ranking is determined solely by the values printed by Op Time:
during your run.
That Op Time
is computed by wrapping the MXNet op that you implement in a timer.
We are going to suggest a set of possible optimizations for you to attempt.
- Unroll + shared-memory Matrix multiply
- Shared Memory convolution
- Kernel fusion for unrolling and matrix-multiplication
- Weight matrix (kernel values) in constant memory
- Tuning with restrict and loop unrolling (considered as one optimization only if you do both)
- An advanced matrix multiplication algorithm (register-tiled, for example)
- Sweeping various parameters to find best values (block sizes, amount of thread coarsening)
- Exploiting parallelism in input images, input channels, and output channels.
- Multiple kernel implementations for different layer sizes
- Input channel reduction: tree
- Input channel reduction: atomics
- ...
Other optimizations that do not fit in here may also be considered as optimizations. If in doubt, contact the course staff.
Within MXNet, you can use MSHADOW_CUDA_CALL(...);
as is done in new-forward.cuh
.
Or, you can define a macro/function similar to wbCheck
used in WebGPU.
You can gather detailed GPU profile information with nvprof
and view that information with nvvp
.
You can see some simple information like so (as we did in milestone 1):
nvprof <your command here>
You can gather a timeline file like the following:
nvprof -o timeline.nvprof <your command here>
This will generate timeline.nvprof.
You can additionally gather some detailed performance metrics.
nvprof -o timeline.nvprof <your command here>
nvprof --kernels "::forward:1" --analysis-metrics -o forward1_analysis.nvprof <the same command>
nvprof --kernels "::forward:2" --analysis-metrics -o forward2_analysis.nvprof <the same command>
This will generate timeline.nvprof
and *analysis.nvprof
.
--analysis-metrics
significantly slows the run time, you may wish to modify the python scripts to run on smaller datasets during this profiling.
You will need to follow the link rai prints after the execution to retrieve these files. You can use the NVIDIA Visual Profiler (nvvp) to import those files. You will need to install nvvp on your own machine. It can be downloaded as part of the CUDA SDK.
To import the files:
- File > import > select nvprof > next > single process > next
- timeline data file should be your timeline.nvprof
- event/metrics data file should be your analysis.nvprof.
- finish
The process will be similar for any machine without an NVIDIA GPU (like your linux laptop).
If you wish to install it on Windows or macOS, the CUDA Toolkit installer may partially fail if you do not have an NVIDIA GPU. The teaching staff doesn't support this, but you may be able to figure it out.
Establish an ssh session with x-forwarding
ssh -Y <netid>@linux.ews.illinois.edu
Download CUDA toolkit for CentOS 7 and install to ~/software/cuda-10.0
(You may choose a different location).
This takes a while (1GB+ download and install).
mkdir -p $HOME/software \
&& wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux -O cuda10.run \
&& chmod +x cuda10.run \
&& ./cuda10.run --silent --toolkit --toolkitpath=$HOME/software/cuda-10.0
Free up your EWS space (I'm not sure what the disk quotas are)
rm cuda10.run
Optional: modify .bashrc to add ~/software/cuda-10.0/bin
to your path. Or, just run it directly
~/software/cuda-10.0/bin/nvvp &
It may be hard to directly debug by inspecting values during the forward pass since the weights are already trained and the input data is from a real dataset.
You can always extract your implementations into a separate set of files, generate your own test data, and modify rai_build.yml
to build execute your separate test code instead of the MXNet code while developing.
A simple code is provided in build_example
. You could modify the build
step of rai_build.yml in the following way to compile and run it:
commands:
build:
- echo "Building arbitrary code"
- make -C /src/build_example
- echo "Running compiled code"
- /src/build_example/main
If you'd like to develop using a local copy of MXNet, you may do so. Keep in mind your project will be evaluated through rai. Your submission must work through rai.
Let's use the following directory structure for these instructions. The directories will be created each step along the way.
<some root dir>
├── fashion-mnist
├── incubator-mxnet
├── m1.1.py
├── m1.2.py
├── m2.1.py
├── m3.1.py
├── m4.1.py
└── models
The MXNet instructions are available here. A short form of them follows for Ubuntu.
# install mxnet prereqs
sudo apt install -y build-essential git libopenblas-dev liblapack-dev libopencv-dev python-pip python-dev python-setuptools python-numpy
# download MXNet release 1.3.0
git clone --single-branch --depth 1 --branch v1.3.0 --recursive https://github.com/apache/incubator-mxnet
# build MXNet
nice -n20 make -C incubator-mxnet -j`nproc` USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1 USE_BLAS=openblas
# install python bindings
pip2 install --user -e incubator-mxnet/python
You can always uninstall the python package with
pip2 uninstall mxnet
The training dataset is a modified version of the mxnet dataset. The scripts to generate it are written in python3
# install data-generation prereqs
sudo apt install python3 python3-pip
pip3 install --user numpy scikit-image
mkdir -p fashion-mnist
wget -P fashion-mnist \
https://github.com/illinois-impact/ece408_mxnet_docker/raw/2019sp/scripts/generate-data.py \
https://github.com/illinois-impact/ece408_mxnet_docker/raw/2019sp/scripts/reader.py
Run the generation script. It will download the fashion-mnist dataset and resize it, which may take a few minutes and consume a few hundred megabytes of disk space
chmod +x fashion-mnist/generate-data.py
fashion-mnist/generate-data.py fashion-mnist
Download the trained models (for the existing MXNet implementation and your implementation) using
mkdir -p models \
&& wget -P models \
https://github.com/illinois-impact/ece408_mxnet_docker/raw/2019sp/models/baseline-0002.params \
https://github.com/illinois-impact/ece408_mxnet_docker/raw/2019sp/models/baseline-symbol.json \
https://github.com/illinois-impact/ece408_mxnet_docker/raw/2019sp/models/ece408-002.params \
https://github.com/illinois-impact/ece408_mxnet_docker/raw/2019sp/models/ece408-symbol.json
Download the scripts we use for evaluation (needs to be modified to use 74x74 input image size)
wget \
https://github.com/illinois-impact/ece408_mxnet_docker/raw/2019sp/scripts/m1.1.py \
https://github.com/illinois-impact/ece408_mxnet_docker/raw/2019sp/scripts/m1.2.py \
https://github.com/illinois-impact/ece408_mxnet_docker/raw/2019sp/scripts/m2.1.py \
https://github.com/illinois-impact/ece408_mxnet_docker/raw/2019sp/scripts/m3.1.py \
https://github.com/illinois-impact/ece408_mxnet_docker/raw/2019sp/scripts/m4.1.py
Download the skeleton source files into incubator-mxnet. This is also where you will put the skeleton code from ece408_src
.
wget -P incubator-mxnet/src/operator/custom \
https://github.com/illinois-impact/ece408_mxnet_docker/raw/2019sp/ece408_src/new.cc \
https://github.com/illinois-impact/ece408_mxnet_docker/raw/2019sp/ece408_src/new.cu \
https://github.com/illinois-impact/ece408_mxnet_docker/raw/2019sp/ece408_src/new-inl.h
Modify the python forward convolution scripts to point to where you downloaded fashion-mnist
... load_mnist(path="fashion-mnist", ...)
Modify the python forward convolution scripts to point to where you downloaded the models
lenet_model = mx.mod.Module.load(prefix='models/baseline' ...
Build your modified MXNet
cp <your source files> incubator-mxnet/src/operator/custom
make -C incubator-mxnet USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1
new-forward.h
and new-forward.cuh
contain skeleton implementations for CPU and GPU convolutions. You can complete the project by modifying only these two files. These functions are called from Forward()
in new-inl.h
.
The code in new-inl.h
, new.cc
, and new.cu
describes the convolution layer to MXNet. You should not modify these files. They are provided for your curiosity.
As of rai 0.2.20, When you use the --submit
flag, a golden version of these files from here is used.
File | Function | Description |
---|---|---|
new-forward.h |
forward() |
Your CPU implementation goes here. |
new-forward.cuh |
forward() |
Your GPU host code goes here. |
new-forward.cuh |
forward_kernel() |
Your GPU kernel implementation goes here. |
-- | -- | -- |
new-inl.h |
InferShape() |
Computes shape of output tensor from input and kernel shape |
new-inl.h |
InferType() |
Computes type of the output tensor based on the inputs. |
new-inl.h |
Forward() |
Defines the operations of the forward pass. Calls our implementation. |
new-inl.h |
Backward() |
Defines the operations of the backward (training) pass. Not used in this project. |
new-inl.h |
struct NewParam |
Defines the arguments passed to the operator in python. |
new.cc |
CreateOperatorEx() |
Called by MXNet to create the appropriate operator for a CPU or GPU execution. |
new.cc |
CreateOp<cpu>() |
Creates the CPU operator. |
new.cu |
CreateOp<gpu>() |
Creates the GPU operator when CUDA is enabled. |
The x
, y
, and k
tensors constructed in new-inl.h
/Forward()
have the following data layout:
Tensor | Descrption | Data Layout |
---|---|---|
x |
Input data | batch size * input channels * y * x |
y |
Output data | batch size * output channels * y * x |
k |
kernel weights | output channels * input channels * y * x |
You can see this being constructed in new-inl.h
/InferShape()
.
The Docker containers that we use to run your code runs on CUDA 10.0. To view the nvprof results, you need to install the CUDA tookkit locally.
You can download the CUDA toolkit from: https://developer.nvidia.com/cuda-downloads. Follow the installation instructions.
If you dont have CUDA enabled (Nvidia GPU), then dont install the driver. Just use the CUDA toolkit and it should work smoothly. If you are stuck on how to use, please visit the TA office hours.
We might consider updating the CUDA tool version inside the Docker container. We will inform incase if we do.
NCSA/UIUC © 2018 Carl Pearson
Modified in fall 2018 Vikram
Last modified by Rui Lan and Zhichun Wan