Welcome the Center for High Performance Computing (CHPC)'s Student Cluster Competition (SCC) - Team Selection Round. This round requires each team to build a prototype multi-node compute cluster within the National Integrated Cyber Infrastructure Systems (NICIS) virtual compute cloud (described below).
The goal of this document is to introduce you to the competition platform and familiarise you with some Linux and systems administration concepts. This competition provides you with a fixed set of virtual resources, that you will use to initialize a set a set of virtual machines instances based on your choice or flavor of Linux.
The CHPC invites applications from suitably qualified candidates to enter the CHPC Student Cluster Competition. The CHPC Student Cluster Competition gives undergraduate students at South African universities exposure to the High Performance Computing (HPC) Industry. The winning team will be entered into the ISC Student Cluster Competition hosted at the 2025 International Supercomputing Conference held in Hamburg, Germany.
You will be accessing all of the course work and material through this GitHub repository, which you and your team must check regularly to receive updates.
You are strongly encouraged to get help and even assist others by Opening and Participating in Discussions.
Tip
Active participation in the student discussions is an easy way to separate yourselves from the rest of the competition and make it easy for the instructors to notice you!
Everyday will comprise of four lectures in the mornings and tutorials taking place in the afternoons. A PDF Version of the Timetable is available for you to download.
Teams will be evaluate according to the following breakdown, with your progress in the tutorials and your final presentations carrying the most weight.
Component | Weight |
---|---|
Technical Knowledge Assessment | 0.1 |
Tutorials | 0.4 |
Cluster Design Assignment (Part 1) | 0.1 |
Cluster Design Presentation | 0.4 |
The role of mentors, instructors and volunteers is to provide leadership and guidance for the student competitors participating in this year's Center for High Performance Computing 2024 Student Cluster Competition.
In preparing your teams for the competition, your main goal is to ensure that you teach and impart knowledge to the student participants in such a way that they are empowered and enable to tackle the problems and benchmarking tasks themselves.
Under no circumstances whatsoever may mentors touch any competition hardware belonging to either their team, or the competition hardware of another team. Mentors are encouraged to provide guidance and leadership to their (as well as other) teams.
Any mentors found to be directly in contravention of this rule, may result in their team incurring a penalty. Repeated infringements may result in possible disqualification of their team.
Below is a table with a number of Linux system commands and utilities that you may find useful in assisting you to debug problems that you may encounter with your clusters. Note that some of these utilities do not ship with the base deployment of a number of Linux flavors, and you may be required to install the associated packages, prior to making use of them.
Command | Description |
---|---|
ssh | Used from logging into the remote machine and for executing commands on the remote machine. |
scp | SCP copies files between hosts on a network. It uses ssh for data transfer, and uses the same authentication and provides the same security as ssh. |
wget / curl | Utility for non-interactive download of files from the Web.It supports HTTP, HTTPS, and FTP protocols. |
top / htop / btop | Provides a dynamic real-time view of a running system. It can display system summary information as well as a list of processes or threads. |
screen / tmux | Full-screen window manager that multiplexes a physical terminal between several processes (typically interactive shells). |
ip a | Display IP Addresses and property information |
dmesg | Prints the message buffer of the kernel. The output of this command typically contains the messages produced by the device drivers |
watch | Execute a program periodically, showing output fullscreen. |
df -h | Report file system disk space usage. |
ping | PING command is used to verify that a device can communicate with another on a network. |
lynx | Command-line based web browser (more useful than you think) |
ctrl+alt+[F1...F6] | Open another shell session (multiple ‘desktops’) |
ctrl+z | Move command to background (useful with ‘bg’) |
du -h | Summarize disk usage of each FILE, recursively for directories. |
lscpu | Command line utility that provides system CPU related information. |
lstotp | View the topology of a Linux system. |
inxi | Lists information related to your systems' sensors, partitions, drives, networking, audio, graphics, CPU, system, etc... |
hwinfo | Hardware probing utility that provides detailed info about various components. |
lshw | Hardware probing utility that provides detailed info about various components. |
proc | Information and control center of the kernel, providing a communications channel between kernel space and user space. Many of the preceding commands query information provided by proc, i.e. cat /proc/cpuinfo . |
uname | Useful for determining information about your current flavor and distribution of your operating system and its version. |
lsblk | Provides information about block devices (disks, hard drives, flash drives, etc) connected to your system and their partitioning schemes. |
You will need to submit the following for scoring and evaluation by the judges:
- Cluster Design Assignment (Part 1) [10 %]
- Cluster Design Assignment (Part 2) [40 %]
- One PDF Presentation Slide with Team Profiles
This slide must clearly indicate your Team Name and Institution. Below each team member's photograph, indicate their
- Name and surname,
- Degree and Year of study,
- Presentation Slides
- Short Technical Brief with Cluster Design Specifications
- One PDF Presentation Slide with Team Profiles
This slide must clearly indicate your Team Name and Institution. Below each team member's photograph, indicate their
- Technical Knowledge Assessment [10 %]
- Tutorials [40 %]
You are tasked with designing a small cluster, with at least three nodes, to the value of R 400 000.00 (ZAR) and present your design to the judging panel. In your design you must specify hardware and software for an operational cluster and describe how it functions. The design must be based on servers and interconnects from either HPE or Dell, and accessories from either NVIDIA, or AMD or Intel. You must use the prices you find in the Parts List Spreadsheet.
The primary purpose of your HPC cluster is to run one of the following codes as efficiently as possible:
You are not given a choice regarding the application selection. Your team will be told which application to optimize for on Wednesday. For now, you should investigate the codes above to understand their unique hardware and software requirements. You are required to submit a brief (half page) report on your findings to the competition organizers by 23:00 on Tuesday.
In addition, your choice of design must take into consideration:
- Base Platform (Server),
- Target Processing Unit (CPU / GPU),
- Memory, Networking and Storage Requirements,
- System and Application Dependency Software Requirements,
- Ease of Use (Build, Assembly, Deployment),
- Efficiency, Performance, Power Consumption and Reliability and
- Team Management, Coordination and Planning.
Important
You may submit an additional design, that extends upon your small R 400 000.00 cluster, up to the value of R 1 000 000.00. You may use any of the above links for this exercise, using a Dollar to Rand conversion rate or 1:20. You may use GPU's from either AMD or NVIDIA. You may utilize CPUs from either AMD or Intel. You may use either Dell or HPE as a vendor.
The 10 minute slide presentation by the whole team must include your design decisions and the features of your cluster, including: cost, hardware, software, configuration and operation. Each member of the team is required to present even though you will be assessed as a team.
After the presentation the judging panel will have an opportunity to ask questions to each member of your team. All members of your team can be questioned about any part of the cluster, so make sure you are fully familiar with the design.
Each Team must work together to answer and complete the Technical Knowledge Assessment to the best of their ability. Team Captains must email your findings to the organizers no later than 23:00 13th July. You are required to demonstrate your understanding of the concepts in YOUR OWN WORDS. Keep your answers succinct and to the point. Your answers to each of the questions, should not exceed more than 2-3 lines.
You will be evaluated on your overall progress in the tutorials. Below you will find an overview, glossary and high level breakdown of the tutorials. You must progress through four tutorials, which will be released daily. Your overall progress through the tutorials forms a large component of you score. By the end of the week you would have covered a considerable amount of content, use the links provided should you need to refer to a specific section and are having trouble remembering where is it.
Tutorial 1 deals with introducing concepts to users and getting them started with using the virtual lab, standing up the first virtual machine instance and connecting to it remotely. The content is as follows:
- Checklist
- Network Primer
- Launching your First Open Stack Virtual Machine Instance
- Accessing the NICIS Cloud
- Verify your Teams' Project Workspace and Available Resources
- Generating SSH Keys
- Launch a New Instance
- Linux Flavors and Distributions
- OpenStack Instance Flavors
- Networks, Ports, Services and Security Groups
- Key Pair
- Verify that your Instance was Successfully Deployed and Launched
- Associating an Externally Accessible IP Address
- Success State, Resource Management and Troubleshooting
- Introduction to Basic Linux Administration
- Linux Binaries, Libraries and Package Management
- Install, Compile and Run High Performance LinPACK (HPL) Benchmark
Tutorial 2 will demonstrate how to configure and stand-up a compute node, and access it using a transparently created, port forwarding SSH tunnel between your workstation and your head node. You will then install a number of critical services across your cluster.
- Checklist
- Spinning Up a Compute Node on Sebowa(OpenStack)
- Accessing Your Compute Node Using
ProxyJump
Directive - Understanding the Roles of the Head Node and Compute Node
- Manipulating Files and Directories
- Verifying Networking Setup
- Configuring a Simple Stateful Firewall Using nftables
- Network Time Protocol
- Network File System
- Generating an SSH Key for your NFS
/home
- User Account Management
- Ansible User Declaration
- WirGuard VPN Cluster Access
- ZeroTier
Tutorial 3 will demonstrate how to configure, build, compile and install a number of various system software and applications. You will also be building these applications with different tools. Finally, you will learn how to run applications across your cluster.
- Checklist
- Managing Your Environment
- Install Lmod
- Running the High Performance LINPACK (HPL) Benchmark on Your Compute Node
- Building and Compiling OpenBLAS and OpenMPI Libraries from Source
- Intel oneAPI Toolkits and Compiler Suite
- LinPACK Theoretical Peak Performance
- Spinning Up a Second Compute Node Using a Snapshot
- HPC Challenge
- Application Benchmarks and System Evaluation
- GROMACS (ADH Cubic)
- LAMMPS (Lennard-Jones)
- [Qiskit (Quantum Volume)](tutorial3/README.md#qiskit-quantum-volume**
Tutorial 4 demonstrates how to configure docker containers to deploy a monitoring stack, comprising of a metrics database service, an exporting / scraping service and a metric visualization services. You will then learn the very basics of how to visualize and interpret data. You will then learn how to automate the deployment of your Sebowa OpenStack infrastructure. Lastly, you'll deploy a scheduler and submit a job to it.
- Checklist
- Cluster Monitoring
- Configuring and Connecting to your Remote JupyterLab Server
- Automating the Deployment of your OpenStack Instances Using Terraform
- Continuous Integration Using CircleCI
- Slurm Scheduler and Workload Manager
- GROMACS Application Benchmark
In this section you will finds links to all of the livestreams of the lectures (Teams Meetings) and subsequent recordings for you to refer back to.
-
Welcome, Introduction and Getting Started
-
HPC Hardware, HPC Networking and Systems Administration
-
Benchmarking, Compilation and Parallel Computing
-
Administration and Application Visualization
- [Cluster Admin, Ansible & Containers]
- Monitoring
- [Schedulers]
- [Data Visualization & Jupyter Lab]
-
Career Guidance
Important
While we value your feedback, the following sections are primarily targeted as Contributors to the Project. As a student participating in the competition, do NOT spend your time working through any of the material below. However, we would love to have your contributions to the project, after the competition.
You are strongly encouraged to contribute and improve the project by Opening and Participating in Discussions, Raising, Addressing and Resolving Issues. The following guide describes How to clone, push, and pull with git (beginners GitHub tutorial).
In order to effectively manage the various workflows and stages of development, testing and deployment, the project is comprised of three primary branches:
main
: Stable and production-ready deployment branch of the project.stag
: Staging branch which mirrors production and is used for integration testing of new features.dev
: Development branch for incorporating new features and bug fixes.
Editing the content directly, will require the use of Git. Using a terminal application or Git for Windows PowerShell or Git for MobaXTerm.
-
Generate an SSH Key (or use an existing one).
-
Add your SSH key to your Git profile.
-
git clone
a local copy of the repository, to your personal work space.You can copy the command from GitHub itself.
git clone [email protected]:chpc-tech-eval/chpc24-scc-nmu.git
-
When starting work on a new feature or bug fix, create a feature branch off of the development branch and regularly get updates from
dev
to ensure that you remain consistent with any changes todev
:git checkout dev git pull origin dev
-
Create a new branch to work on. i.e.
git branch tutX/bugfix-or-new-feature
followed bygit checkout tutX/bugfix-or-new-feature
, or simply use a single commandgit checkout -b tutX/bugfix-or-new-feature
.- Give the branch a sensible name.
- You are encouraged to push the branch back to the remote so that collaborators can see what you are working on as you make the changes.
-
Make the appropriate changes and commit them locally:
git add <relative_path_to_changed_file(s)> git commit -m "some_message_pertaining_to_changes_made"
-
When you have completed editing your feature, merge any remote changes from
dev
and thenpush
your local changes, back upstream to the remote repository:git pull origin dev # (optional) it is generally a good practice to incorporate any changes in dev into your code early and often git pull origin feature/bugfix-or-new-feature # (optional) if you are collaborating on a specific feature with someone, it is important to incorporate their changes early and often git push origin feature/bugfix-or-new-feature
-
Once you are satisfied with the changes you've have been editing, eliminate all merge conflicts by pulling all remote changes and deviations into your local working copy.
git pull
.- If you are confident that your feature does not or has not deviated from the remote
dev
branch, usegit pull
to automaticallyfetch
andmerge
remote changes fromdev
into your feature branch. - Alternatively, if your branch is old, or depends on / requires changes from remote use
git fetch
, tofetch
remote changes and be able to preview them before merging. - Eliminate your local conflicts and merge all remote changes
git merge
. - Once all the conflicts have been resolved, and you've successfully merged all remote changes, push your branch upstream.
- If you are confident that your feature does not or has not deviated from the remote
-
Create a pull request to the remote
dev
branch on GitHub, to incorporate your feature.- Or another branch, if your feature branch was adding functionality to an existing feature branch.
Use the following guide on Github Markdown Syntax Editing.