Student Cluster Competition - Tutorial 4

Checklist
Cluster Monitoring
Configuring and Connecting to your Remote JupyterLab Server
1. Visualize Your HPL Benchmark Results
2. Visualize Your Qiskit Results
Automating the Deployment of your OpenStack Instances Using Terraform
Continuous Integration Using CircleCI
Slurm Scheduler and Workload Manager
GROMACS Application Benchmark
1. Protein Visualization
2. Benchmark 2 (1.5M Water)

Checklist

This tutorial demonstrates cluster monitoring, data visualization, automated infrastructure as code deployment and workload scheduling. These components are critical to a typical HPC environment.

Monitoring is a widely used component in system administration (including enterprise datacentres and corporate networks). Monitoring allows administrators to be aware of what is happening on any system that is being monitored and is useful to proactively identify where any potential issues may be.

Interpreting and understanding your results and data, is vital to making meaning implementations of said data. You will also automate the provisioning and deployment of your "experimental", change management compute node. Lastly, a workload scheduler ensures that users' jobs are handled properly to fairly balance all scheduled jobs with the resources available at any time.

You will also cover data interpretation and visualization for previously run benchmark applications.

In this tutorial you will:

Setup a monitoring stack using Docker Compose
- Install and setup the pre-requisites
- Create all the files required for configuring the 3 containers to be launched
  - The docker-compose.yml file describing the Node-Exporter, Prometheus and Grafana services
  - The prometheus.yml file describing the metrics to be scraped for each host involved
  - The prometheus-datasource.yaml file describing the Prometheus datasource for Grafana
- Start the services
  - Verify that they are running and accessible (locally, and externally)
- Create a dashboard in Grafana
  - Login to the Grafana endpoint (via your browser)
  - Import the appropriate Node-Exporter dashboard
  - Check that the dashboard is working as expected
Prepare, install and configure remote JupyterLab server
- Connect to JupyterLab and visualize benchmarking results
Automate the provisioning and deployment of your Sebowa OpenStack infrastructure
Install the Slurm workload manager across your cluster.
Submit a test job to run on your cluster through the newly-configured workload manager.

Tip

You're going to be manipulating both your headnode, as well as your compute node(s) in this tutorial.

You are strongly advised to make use of a terminal multiplexer, such as tmux before making a connection to your VMs. Once you're logged into your head node, initiate a tmux session:

tmux

Then split the window into two separate panes with ctrl + b %. SSH into your compute node on the other pane.

Cluster Monitoring

Cluster monitoring is crucial for managing Linux machines. Effective monitoring helps detect and resolve issues promptly, provides insights into resource usage (CPU, memory, disk, network), aids in capacity planning, and ensures infrastructure scales with workload demands. By monitoring system performance and health, administrators can prevent downtime, reduce costs, and improve efficiency.

Traditional Approach Using top or htop

Traditionally, Linux system monitoring involves command-line tools like top or htop. These tools offer real-time system performance insights, displaying active processes, resource usage, and system load. While invaluable for monitoring individual machines, they lack the ability to aggregate and visualize data across multiple nodes in a cluster, which is essential for comprehensive monitoring in larger environments.
Using Grafana, Prometheus, and Node Exporter

Modern solutions use Grafana, Prometheus, and Node Exporter for robust and scalable monitoring. Prometheus collects and stores metrics, Node Exporter provides system-level metrics, and Grafana visualizes this data. This combination enables comprehensive cluster monitoring with historical data analysis, alerting capabilities, and customizable visualizations, facilitating better decision-making and faster issue resolution.
What is Docker and Docker Compose and How We Will Use It

Docker is a platform for creating, deploying, and managing containerized applications. Docker Compose defines and manages multi-container applications using a YAML file. For cluster monitoring on a Rocky Linux head node, we will use Docker and Docker Compose to bundle Grafana, Prometheus, and Node Exporter into deployable containers. This approach simplifies installation and configuration, ensuring all components are up and running quickly and consistently, streamlining the deployment of the monitoring stack.

Note

When the word Input: is mentioned, excpect the next line to have commands that you need to copy and paste into your own terminal.

Whenever the word Output: is mentioned DON'T copy and paste anything below this word as this is just the expected output.

The following configuration is for your head node. You will be advised of the steps you need to take to monitor your compute node(s) at the end.

Install Docker Engine, Containerd and Docker Compose

You will need to have docker, containerd and docker-compose installed on all the nodes that you want to eventually monitor, i.e. your head node and compute node(s).

Prerequisites and dependencies

Refer to the following RHEL Guide

DNF / YUM

# The yum-utils package which provides the yum-config-manager utility
sudo yum install -y yum-utils

# Add and set up the repository for use.
sudo yum-config-manager --add-repo https://download.docker.com/linux/rhel/docker-ce.repo

APT

# Install required package dependencies
sudo apt install apt-transport-https ca-certificates curl software-properties-common -y

# Add the Docker repository
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

Installation

DNF / YUM

# If prompted to accept the GPG key, verify that the fingerprint matches, accept it.
sudo yum install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

APT

sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io -y

Arch

sudo pacman -S docker

# You need to start and enable docker, prior to installing containerd and docker-compose
sudo pacman -S containerd docker-compose

Start and Enable Docker:

sudo systemctl start docker
sudo systemctl enable docker

Install Docker-Compose on Ubuntu

APT

sudo curl -L "https://github.com/docker/compose/releases/download/$(curl -s https://api.github.com/repos/docker/compose/releases/latest | grep -Po '"tag_name": "\K.*\d')/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose

Verify that the Docker Engine installation was successful by running the hello-world image

Download and deploy a test image and run it inside a container. When the container runs, it prints a confirmation message and exits.
```
# Check the verions of Docker
docker --version

# Download and deplpoy a test image
sudo docker run hello-world

# Check your version of Docker Compose
docker-compose --version
```
You have now successfully installed and started Docker Engine.

Installing your Monitoring Stack

Create a suitable directory, e.g. /opt/monitoring_stack

This which you’ll keep a number of important configuration files.
```
sudo mkdir /opt/monitoring_stack/
cd /opt/monitoring_stack/
```
Create and edit your monitoring configurations files
```
sudo nano /opt/monitoring_stack/docker-compose.yml
```

Add the following to the docker-compose.yml YAML file

version: '3'
services:
  node-exporter:
    image: prom/node-exporter
    ports:
      - "9100:9100"
    restart: always
    networks:
      - monitoring-network

  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    restart: always
    volumes:
      - /opt/monitoring_stack/prometheus.yml:/etc/prometheus/prometheus.yml
    networks:
      - monitoring-network

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    restart: always
    environment:
      GF_SECURITY_ADMIN_PASSWORD: <SET_YOUR_GRAFANA_PASSWORD>
    volumes:
      - /opt/monitoring_stack/prometheus-datasource.yaml:/etc/grafana/provisioning/datasources/prometheus-datasource.yaml
    networks:
      - monitoring-network

networks:
  monitoring-network:
    driver: bridge

Create and edit your Prometheus configuration files
```
  sudo nano /opt/monitoring_stack/prometheus.yml
```

Add the following to your prometheus.yml YAML file

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

Configure you Promeheus Data Sources

sudo nano /opt/monitoring_stack/prometheus-datasource.yaml

Add the following to your prometheus-datasource.yaml.

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090

Startup and Test the Monitoring Services

Tip

If you've successfully configured nftables, you will be required to open the following TCP ports 3000, 9090, 9100.

Bring up your monitoring stack and verify that the have been correctly configured

Bring up your monitoring stack
```
sudo docker compose up -d
```
Confirm the status of your Docker Containers
```
sudo docker ps
```

Dump the metrics that are being monitored from your services

# Prometheus
curl -s localhost:9090/metrics | head

# Node Exporter
curl -s localhost:9100/metrics | head

# Grafana
curl -s localhost:3000 | head

Post the output of the above commands as comments to the Discussion on GitHub.

Congratulations on correctly configuring your monitoring services!

SSH Port Local Forwarding Tunnel

SSH port forwarding, also known as SSH tunneling, is a method of creating a secure connection between a local computer and a remote machine through an SSH (Secure Shell) connection. Local port forwarding allows you to forward a port on your local machine to a port on a remote machine. It is commonly used to access services behind a firewall or NAT.

Important

The following is included to demonstrate the concept of TCP Port Forwarding. In the next section, your are:

Opening a TCP Forwarding Port and listening on Port 3000 on your workstation, i.e. http://localhost:3000
You are then binding this SOCKET to TCP Port 3000 on your head node.

The following diagram may facilitate the discussion and illustrate the scenario:

[workstation:3000] ---- SSH Forwarding Tunnel ----> [head node:3000] ---- Grafana Service on head node

# Connect to Grafana's (head node) service directly from your workstation
[http://localhost:3000] ---- SSH Forwarding Tunnel ----> [Grafana (head node)]

Make sure that you understand the above concepts, as it will facilitate your understanding of the following considerations:

If you have successfully configured WireGuard

[workstation:3000] ---- WireGuard VPN ----> [head node:3000] ---- Grafana Service on head node

# Connect to Grafana's (head node) service directly from your workstation
[http://<head node (private wiregaurd ip)>:3000] ---- WireGuard VPN ----> [Grafana (head node)]

And / or if you have successfully configured ZeroTier

[workstation:3000] ---- ZeroTier VPN ----> [head node:3000] ---- Grafana Service on head node

# Connect to Grafana's (head node) service directly from your workstation
[http://<head node (private zerotier ip)>:3000] ---- ZeroTier VPN ----> [Grafana (head node)]

Caution

You need to ensure that you have understood the above discussions. This section on port forwarding, is included for situations where you do know have sudo rights on the machine your are working on and cannot open ports or install applications via sudo, then you can forward ports over SSH.

Take the time now however, to ensure that all of your team members understand that there are a number of methods with which you can access remote services on your head node:

http://154.114.57.x:3000
http://localhost:3000
http://<headnode wireguard ip>:3000
http://<headnode zerotier ip>:3000

Once you have understood the above considerations, you may proceed to create a TCP Port Forwarding tunnel, to connect your workstation's port, directly to your head node's, over a tunnel.

Create SSH Port Forwarding Tunnel on your local workstation

Open a new terminal and run the tunnel command (replace 157.114.57.x with your unique IP):
```
ssh -L 3000:localhost:3000 rocky@157.114.57.x
```

Create a Dashboard in Grafana

From a browser on your workstation navigate to the Grafana dashboard on your head node
Go to a browser and login to Grafana:

Login to you Grafana dashboards

username: admin
password: <YOUR_GRAFANA_PASSWORD>

Go to Dashboards
Click on New then Import
Input: 1860 and click Load
Click on source: "Prometheus"
Click on Import:

Success State, Next Steps and Troubleshooting

Congratulations on successfully deploying your monitoring stack and adding Grafana Dashboards to visualize this.

If you've managed to successfully configure your dash boards for your head node, repeat the steps for deploying Node Exporter on your compute node(s).

Note

Should you have any difficulties running the above configuration, use the alternative process below to deploy your monitoring stack. Click on the heading to reveal content.

Installing your monitoring stack from pre-compiled binaries

For this tutorial we will install from pre-complied binaries.

Prometheus

The installation and the configuration of Prometheus should be done on your headnode.

Create a Prometheus user without login access, this will be done manually as shown below:

sudo useradd --no-create-home --shell /sbin/nologin prometheus

Download the latest stable version of Prometheus from the official site using wget

wget https://github.com/prometheus/prometheus/releases/download/v2.33.1/prometheus-2.33.1.linux-amd64.tar.gz

Long list file to verify Prometheus was downloaded

ll

Extract the downloaded archive and move prometheus binaries to the /usr/local/bin directory.

tar -xvzf prometheus-2.33.1.linux-amd64.tar.gz
cd prometheus-2.33.1.linux-amd64
sudo mv prometheus promtool /usr/local/bin/

Move back to the home directory, create directorise for prometheus.

cd ~
sudo mkdir /etc/prometheus
sudo mkdir /var/lib/prometheus

Set the correct ownership for the prometheus directories

sudo chown prometheus:prometheus /etc/prometheus/
sudo chown prometheus:prometheus /var/lib/prometheus

Move the configuration file and set the correct permissions

cd prometheus-2.33.1.linux-amd64
sudo mv consoles/ console_libraries/ prometheus.yml /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus/

Configure Prometheus
Edit the /etc/prometheus/prometheus.yml file to configure your targets(compute node)

Hint : Add the job configuration for the compute_node in the scrape_configs section of your Prometheus YAML configuration file. Ensure that all necessary configurations for this job are correctly placed within the relevant sections of the YAML file.:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: "compute_node"
    static_configs:
      - targets: ["<compute_node_ip>:9100"]

Create a service file to manage Prometheus with systemctl, the file can be created with the text editor nano (Can use any text editor of your choice)

sudo nano /etc/systemd/system/prometheus.service

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
Restart=on-failure
RestartSec=5s
ExecStart=/usr/local/bin/prometheus \
 --config.file=/etc/prometheus/prometheus.yml \
 --storage.tsdb.path=/var/lib/prometheus/ \
 --web.console.templates=/etc/prometheus/consoles \
 --web.console.libraries=/etc/prometheus/console_libraries

[Install]
WantedBy=multi-user.target

Reload the systemd daemon, start and enable the service

sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus

Check that your service is active by checking the status

sudo systemctl status prometheus

[!TIP] If when you check the status and find that the service is not running, ensure SELinux or AppArmor is not restricting Prometheus from running. Try disabling SELinux/AppArmor temporarily to see if it resolves the issue:
sudo setenforce 0
Then repeat steps 10 and 11.

If the prometheus service still fails to start properly, run the command journalctl –u prometheus -f --no-pager and review the output for errors.

[!IMPORTANT] If you have a firewall running, add a TCP rule for port 9090

Verify that your prometheus configuration is working navigating to http://<headnode_ip>:9090 in your web browser, access prometheus web interface. Ensure that the headnode_ip is the public facing ip.

Node Exporter

Node Exporter is a Prometheus exporter specifically designed for hardware and OS metrics exposed by Unix-like kernels. It collects detailed system metrics such as CPU usage, memory usage, disk I/O, and network statistics. These metrics are exposed via an HTTP endpoint, typically accessible at <node_ip>:9100/metrics. The primary role of Node Exporter is to provide a source of system-level metrics that Prometheus can scrape and store. This exporter is crucial for gaining insights into the health and performance of individual nodes within a network.

The installation and the configuration node exporter will be done on the compute node/s

Create a Node Exporter User

sudo adduser -M -r -s /sbin/nologin node_exporter

Download and Install Node Exporter, this is done using wget as done before

cd /usr/src/

sudo wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz

sudo tar xvf node_exporter-1.6.1.linux-amd64.tar.gz

Next, move the node exporter binary file to the directory '/usr/local/bin' using the following command

mv node_exporter-*/node_exporter /usr/local/bin

Create a service file to manage Node Exporter with systemctl, the file can be created with the text editor nano (Can use any text editor of your choice)

sudo nano /etc/systemd/system/node_exporter.service

[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

[!IMPORTANT] If firewalld is enabled and running, add a rule for port 9100
sudo firewall-cmd --permanent --zone=public --add-port=9100/tcp
sudo firewall-cmd --reload 

Reload the systemd daemon, start and enable the service

sudo systemctl daemon-reload
sudo systemctl enable node_exporter 
sudo systemctl start node_exporter

Check that your service is active by checking the status

sudo systemctl status node_exporter

SSH Tunneling

In order to verify that node exporter is set up correctly we need to access <node_ip>:9100/metrics. This can only been done by simply going to your broswer and putting it in as we did with Prometheus, we need to use a SSH tunnel.

What is SSH Tunneling?
SSH tunneling, also known as SSH port forwarding, is a method of securely forwarding network traffic from one network node to another via an encrypted SSH connection. It allows you to securely transmit data over untrusted networks by encrypting the traffic.

Why Use SSH Tunneling in This Scenario?
In this setup, the compute node has only a private IP and is not directly accessible from the internet. The headnode, however, has both a public IP (accessible from the internet) and a private IP (in the same network as the compute node).

Using SSH tunneling allows us to:

Access Restricted Nodes: Since the compute node is only reachable from the headnode, we can create an SSH tunnel through the headnode to access the compute node.
Secure Transmission: The tunnel encrypts the traffic between your local machine and the compute node, ensuring that any data sent through this tunnel is secure.
Simplify Access: By tunneling the Node Exporter port (9100) from the compute node to your local machine, you can access the metrics as if they were running locally, making it easier to monitor and manage the compute node.
1. Set Up SSH Tunnel on Your Local Machine

ssh -L 9100:compute_node_ip:9100 user@headnode_ip -N

ssh -L: This option specifies local port forwarding. It maps a port on your local machine (first 9100) to a port on a remote machine (second 9100 on compute_node_ip) via the SSH server (headnode).
compute_node_ip:9100: The target address and port on the compute node where Node Exporter is running. user@headnode_ip: The SSH connection details for the headnode.
-N: Tells SSH to not execute any commands, just set up the tunnel.
1. By navigating to http://localhost:9100/metrics in your web browser, you can access the Node Exporter metrics from the compute node as if the service were running locally on your machine.

Grafana

Grafana is an open-source platform for monitoring and observability, known for its capability to create interactive and customizable dashboards. It integrates seamlessly with various data sources, including Prometheus. Through its user-friendly interface, Grafana allows users to build and execute queries to visualize data effectively. Beyond visualization, Grafana also supports alerting based on the visualized data, enabling users to set up notifications for specific conditions. This makes Grafana a powerful tool for both real-time monitoring and historical analysis of system performance.

Now we go back to the headnode for the installation and the configuration of Grafana

Add the Grafana Repository, by adding the following directives in this file:

sudo nano /etc/yum.repos.d/grafana.repo

 [grafana] 
 name=grafana 
 baseurl=https://rpm.grafana.com 
 repo_gpgcheck=1 
 enabled=1 
 gpgcheck=1 
 gpgkey=https://rpm.grafana.com/gpg.key 
 sslverify=1 
 sslcacert=/etc/pki/tls/certs/ca-bundle.crt 
 exclude=*beta*

Install Grafana

sudo dnf install grafana -y

Start and Enable Grafana

sudo systemctl start grafana-server
sudo systemctl enable grafana-server

Check the status of grafana-server

sudo systemctl status grafana-server

[!IMPORTANT] If firewalld is enabled and running, add a rule for port 9100
sudo firewall-cmd --permanent --zone=public --add-port=3000/tcp
sudo firewall-cmd --reload 

Configuring and Connecting to your Remote JupyterLab Server

Project Jupyter provides powerful tools for scientific investigations due to their interactive and flexible nature. Here are some key reasons why they are favored in scientific research.

Interactive Computing and Immediate Feedback

Run code snippets and see the results immediately, which helps in quick iterations and testing of hypotheses. Directly plot graphs and visualize data within the notebook, which is crucial for data analysis.
Documentation and Rich Narrative Text

Combine code with Markdown text to explain the methodology, document findings, and write detailed notes. Embed images, videos, and LaTeX equations to enhance documentation and understanding.
Reproducibility

Share notebooks with others to ensure that they can reproduce the results by running the same code. Use tools like Git to version control the notebooks, ensuring a record of changes and collaborative development.
Data Analysis and Visualization

Utilize a wide range of Python libraries such as NumPy, Pandas, Matplotlib, and Seaborn for data manipulation and visualization. Perform exploratory data analysis (EDA) seamlessly with powerful plotting libraries.

Jupyter Notebooks provide a versatile and powerful environment for conducting scientific investigations, facilitating both the analysis and the clear communication of results.

Start by installing all the prerequisites

You would have already installed most these from Qiskit Benchmark in tutorial 3.

DNF / YUM

# RHEL, Rocky, Alma, CentOS Stream
sudo dnf install python python-pip

APT

# Ubuntu
sudo apt install python python-pip

Pacman
```
# Arch
sudo pacman -S python python-pip
```

Open TCP port 8889 on your nftables firewall, and restart the service
```
sudo nano /etc/nftables/hn.nft
sudo systemctl restart nftables
```

Tip

There are a number of plotting utilities available in Python. Each with their own advantages and disadvantages. You will be using Plotly in the following exercises.

Visualize Your HPL Benchmark Results

You will now visualize the results from the table you prepared of Rmax (GFlops/s) scores for different configurations of HPL.

Create and Activate a New Python Virtual Environment

Separate your python projects and ensure that they exist in their own, clean environments:
```
python -m venv hplScores
source hplScores/bin/activate
```
Install Project Jupyter and Plotly plotting utilities and dependencies
```
pip install jupyterlab ipywidgets plotly jupyter-dash
```
Start the JupyterLab server
```
jupyter lab --ip 0.0.0.0 --port 8889 --no-browser
```
- --ip binds to all interfaces on your head node, including the public facing address
- --port bind to the port that you granted access to in nftables
- --no-browser, do not try to launch a browser directly on your head node.

Carefully copy your <TOKEN> from the command line after successfully launching your JupyterLab server.

# Look for a line similar to the one below, and carefully copy your <TOKEN>
http://127.0.0.1:8889/lab?token=<TOKEN>

Open a browser on you workstation and navigate to your JupyterLab server on your headnode:
```
http://<headnode_public_ip>:8889
```
Login to your JupyterLab server using your <TOKEN>.

Create a new Python Notebook and plot your HPL results:

import plotly.express as px
x=["Head [<treads>]", "Compute Repo MPI and BLAS [<threads>]", "Compute Compiled MPI and BLAS [<threads>]", "Compute Intel oneAPI Toolkits", "Two Compute Nodes", "etc..."]
y=[<gflops_headnode>, <gflops_compute>, <gflops_compute_compiled_mpi_blas>, <gflops_compute_intel_oneapi>, <gflops_two_compute>, <etc..>]
fig = px.bar(x, y)
fig.show()

Click on the camera icon to download and save your image. Post your results as a comment, replying to this GitHub discussion thread.

Visualize Your Qiskit Results

You are now going to extend your qv_experiment and plot your results, by drawing a graph of "Number of Qubits vs Simulation time to Solution":

Create and Activate a New Python Virtual Environment

Separate your python projects and ensure that they exist in their own, clean environments:
```
python -m venv
source QiskitAer/bin/activate
```
You may need to install additional dependencies
```
pip install matplotlib jupyterlab
```

Append the following to your qv_experiment.py script:

# number of qubits, for your system see how much higher that 30 your can go...
num_qubits = np.arrange(2, 10)

# QV Depth
qv_depth = 5

# For bonus points submit results with up to 20 or even 30 shots
# Note that this will be more demanding on your system
num_shots = 10

# Array for storing the output results
result_array = [[], []]

# iterate over qv depth and number of qubits
for i in num_qubits:
  result_array[i] = quant_vol(qubits=i, shots=num_shots, depth=qv_depth)
  # for debugging purposes you can optionally print the output
  print(i, result_array[i])

import matplotlib.pyplot as plt
plt.xlabel('Number of qubits')
plt.ylabel('Time (sec)')
plt.plot(num_qubits, results_array)
plt.title('Quantum Volume Experiment with depth=' + str(qv_depth))
plt.savefig('qv_experiment.png')

Run the benchmark by executing the script you've just written:
```
python qv_experiment.py
```

Automating the Deployment of your OpenStack Instances Using Terraform

Terraform is a piece of software that allows one to write out their cloud infrastructure and deployments as code, IaC. This allows the deployments of your cloud virtual machine instances to be shared, iterated, automated as needed and for software development practices to be applied to your infrastructure.

In this section of the tutorial, you will be deploying an additional compute node from your head node using Terraform.

Caution

In the following section, you must request additional resources from the instructors. This additional node will be experimental for testing your changes to your cluster before committing them to your active compute nodes. You will be deleting and reinitializing this instance often. Make sure you understand how to Delete Instance.

Install and Initialize Terraform

You will now prepare, install and initialize Terraform on your head node. You will define and configure a providers.tf file, to configure OpenStack instances (as Sebowa is an OpenStack based cloud).

Use your operating system's package manager to install Terraform

This could be your workstation or one of your VMs. The machine must be connected to the internet and have access to your OpenStack workspace, i.e. https://sebowa.nicis.ac.za

DNF / YUM

sudo yum update -y

# Install package to manage repository configurations
sudo yum install -y dnf-plugins-core

# Add the HashiCorp Repo
sudo dnf config-manager --add-repo https://rpm.releases.hashicorp.com/RHEL/hashicorp.repo

sudo dnf install -y terraform

APT

# Update package repository
sudo apt-get update
sudo apt-get install -y gnupg software-properties-common

# Add HashiCorp GPG Keys
wget -O- https://apt.releases.hashicorp.com/gpg | gpg --dearmor | sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg

# Add the official HashiCorp Linux Repo
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list

Pacman

# Arch
sudo pacman -S terraform

Create a Terraform directory, descend into it and Edit the providers.tf file
```
mkdir terraform
cd terraform
vim providers.tf
```
You must specify a Terraform Provider

These can vary from MS Azure, AWS, Google, Kubernetes etc... We will be implementing an OpenStack provider as this is what is implemented on the Sebowa cloud platform. Add the following to the providers.tf file.
```
terraform {
  required_providers {
    openstack = {
      source = "terraform-provider-openstack/openstack"
      version = "1.46.0"
    }
  }
}
```
Initialize Terraform

From the folder with your provider definition, execute the following command:
```
terraform init
```

Generate `clouds.yml` and `main.tf` Files

Generate and configure the cloud.yml file that will authenticate you against your Sebowa OpenStack workspace, and the main.tffiles that will define how your infrastructure should be provisioned.

Generate OpenStack API Credentials

From your team's Sebowa workspace, navigate to Identity → Application Credentials, and generate a set of OpenStack credentials in order to allow you to access and authenticate against your workspace.

Download and Copy the clouds.yml File

Copy the clouds.yml file to the folder where you initialized terraform. The contents of the of which, should be similar to:

# This is a clouds.yaml file, which can be used by OpenStack tools as a source
# of configuration on how to connect to a cloud. If this is your only cloud,
# just put this file in ~/.config/openstack/clouds.yaml and tools like
# python-openstackclient will just work with no further config. (You will need
# to add your password to the auth section)
# If you have more than one cloud account, add the cloud entry to the clouds
# section of your existing file and you can refer to them by name with
# OS_CLOUD=openstack or --os-cloud=openstack
clouds:
  openstack:
    auth:
      auth_url: https://sebowa.nicis.ac.za:5000
      application_credential_id: "<YOUR TEAM's APPLICATION CREDENTIAL ID"
      application_credential_secret: "<YOUR TEAM's APPLICATION CREDENTIAL SECRET>"
    region_name: "RegionOne"
    interface: "public"
    identity_api_version: 3
    auth_type: "v3applicationcredential"

Create main.tf Terraform File Inside your terraform folder, you must define a main.tf file. This file is used to identify the provider to be implemented as well as the compute resource configuration details of the instance we would like to launch.

You will need to define your own main.tf file, but below is an example of one such definition:

provider "openstack" {
  cloud = "openstack"
}
resource "openstack_compute_instance_v2" "terraform-demo-instance" {
  name = "scc24-arch-cn03"
  image_id = "33b938c8-6c07-45e3-8f2a-cc8dcb6699de"
  flavor_id = "4a126f4f-7df6-4f95-b3f3-77dbdd67da34"
  key_pair = "nlisa at mancave"
  security_groups = ["default", "ssc24_sq"]

  network {
    name = "nlisa-vxlan"
  }
}

Note

You must specify your own variables for name, image_id, flavor_id, key_pair and network.name.

Generate, Deploy and Apply Terraform Plan

Generate and Deploy Terraform Plan Create a Terraform plan based on the current configuration. This plan will be used to implement changes to your Sebowa OpenStack cloud workspace, and can be reviewed before applying those changes. Generate a plan and write it to disk:
```
terraform plan -out ~/terraform/plan
```
Once you are satisfied with the proposed changes, deploy the terraform plan:
```
terraform apply ~terraform/plan
```
Verify New Instance Successfully Created by Terraform Finally confirm that your new instance has been successfully created. On your Sebowa OpenStack workspace, navigate to Project → Compute → Instances.

Tip

To avoid losing your team's progress, it would be a good idea to create a GitHub repo in order for you to commit and push your various scripts and configuration files.

Continuous Integration Using CircleCI

Circle CI is a Continuous Integration and Continuous Delivery platform that can be utilized to implement DevOps practices. It helps teams build, test, and deploy applications quickly and reliably.

In this section of the tutorials you're going to be expanding on the OpenStack instance automation with CircleCI Workflows and Pipelines. For this tutorial you will be using your GitHub account which will integrate directly into CircleCI.

Prepare GitHub Repository

You will be integration GitHub into CircleCI workflows, wherein every time you commit changes to your deploy_compute GitHub repository, CircleCI will instantiate and trigger Terraform, to create a new compute node VM on Sebowa.

Create GitHub Repository If you haven't already done so, sign up for a GitHub Account. Then create an empty private repository with a suitable name, i.e. deploy_compute_node:
Add your team members to the repository to provide them with access:
If you haven't already done so, add your SSH key to your GitHub account by following the instructions from Steps to follow when editing existing content.

Tip

You will be using your head node to orchestrate and configure your infrastructure. Pay careful attention to ensure that you copy over your head node's public SSH key. Administrating and managing your compute nodes in this manner requires you to think about them as "cattle" and not "pets".

Reuse `providers.tf` and `main.tf` Terraform Configurations

On your head node, create a folder that is going to be used to initialize the GitHub repository:
```
mkdir ~/deploy_compute_node
cd ~/deploy_compute_node
```

Copy the providers.tf and main.tf files you had previously generated:

cp ~/terraform/providers.tf ./
cp ~/terraform/main.tf ./
vim main.tf

Create `.circleci/config.yml` File and `push` Project to GitHub

The .circle/config.yml configuration file is where you define your build, test and deployment process. From your head node, you are going to be pushing your Infrastructure as Code to your private GitHub repository. This will then automatically trigger the CircleCI deployment of a Docker container which has been tailored for Terraform operations and instructions that will deploy your Sebowa OpenStack compute node instance.

Create and edit .circleci/config.yml:

mkdir .circleci
vim .circleci/config.yml # Remember that if you are not comfortable using Vim, install and make use of Nano

Copy the following configuration into .circle/config.yml:

version: 2.1

jobs:
  deploy:
    docker:
      - image: hashicorp/terraform:latest
    steps:
      - checkout

      - run:
          name: Create clouds.yaml
          command: |
            mkdir -p ~/.config/openstack
            echo "clouds:
              openstack:
                auth:
                  auth_url: https://sebowa.nicis.ac.za:5000
                  application_credential_id: ${application_credential_id}
                  application_credential_secret: ${application_credential_secret}
                region_name: "RegionOne"
                interface: "public"
                identity_api_version: 3
                auth_type: "v3applicationcredential"" > ~/.config/openstack/clouds.yaml

      - run:
          name: Terraform Init
          command: terraform init

      - run:
          name: Terraform Apply
          command: terraform apply -auto-approve

workflows:
  version: 2
  deploy_workflow:
  jobs:
    - deploy

Version: Specifies the configuration version.
Jobs: Defines the individual steps in the build process, where we've defined a build job that runs inside the latest Terraform Docker container from Hashicorp.
Steps: The steps to execute within the job:
- checkout: Clone and checkout the code from the repository.
- run: Executes a number of shell commands to create the clouds.yaml file, then initialize and apply the Terraform configuration.
Workflows: Defines the workflow(s) that CircleCI will follow, where in this instance there is a single workflow specified deploy_workflow, that runs the deploy job.

Initialize the Git Repository, add the files you've just created and push to GitHub: Following the instructions from the previous section where you created a new GitHub repo, execute the following commands from your head node, inside the deploy_compute_node folder:

cd ~/deploy_compute_node
git init
git add .
git commit -m "Initial Commit." # You may be asked to configure you Name and Email. Follow the instructions on the screen before proceeding.
git branch -M main
git remote add origin git@github.com:<TEAM_NAME>/deploy_compute_node.git
git push -u origin main

The new files should now be available on GitHub.

Create CircleCI Account and Add Project

Navigate to CircleCI.com to create an account, link and add a new GitHub project.

Create a new organization and give it a suitable name
Once you've logged into your workspace, go to projects and create a new project
Create a new IaC Project
If your repository is on GitHub, create a corresponding project
Pick a project name and a repository to associate it to
Push the configuration to GitHub to trigger workflow

Important

You're going to need to delete your experimental compute node instance on your Sebowa OpenStack workspace, each time you want to test or run the CircleCI integration. It has been included here for demonstration purposes, so that you may begin to see the power and utility of CI/CD and automation.

Navigate to your Sebowa OpenStack workspace to ensure that they deployment was successful.

Consider how you could streamline this process even further using preconfigured instance snapshots, as well as Ansible after your instances have been deployed.

Slurm Scheduler and Workload Manager

The Slurm Workload Manager (formerly known as Simple Linux Utility for Resource Management), is a free and open-source job scheduler for Linux, used by many of the world's supercomputers/computer clusters. It allows you to manage the resources of a cluster by deciding how users get access for some duration of time so they can perform work. To find out more, please visit the Slurm Website.

Prerequisites

Make sure the clocks, i.e. chrony daemons, are synchronized across the cluster.
Generate a SLURM and MUNGE user on all of your nodes:
- If you have Ansible User Module working
  - Create the users as shown in tutorial 2 Do NOT add them to the sysadmin group.
- If you do NOT have your Ansible User Module working
  - useradd slurm
  - Ensure that users and groups (UIDs and GIDs) are synchronized across the cluster. Read up on the appropriate /etc/shadow and /etc/password files.

Head Node Configuration (Server)

Install the MUNGE package. MUNGE is an authentication service that makes sure user credentials are valid and is specifically designed for HPC use.

First, we will enable the EPEL (Extra Packages for Enterprise Linux) repository for dnf, which contains extra software that we require for MUNGE and Slurm:
```
  sudo dnf install epel-release
```
Then we can install MUNGE, pulling the development source code from the crb "CodeReady Builder" repository:
```
  sudo dnf config-manager --set-enabled crb
  sudo dnf install munge munge-libs munge-devel
```

Generate a MUNGE key for client authentication:

  sudo /usr/sbin/create-munge-key -r
  sudo chown munge:munge /etc/munge/munge.key
  sudo chmod 600 /etc/munge/munge.key

Using scp, copy the MUNGE key to your compute node to allow it to authenticate:
1. SSH into your compute node and create the directory /etc/munge. Then exit back to the head node.
2. Since, munge has not yet been installed on your compute node, first transfer the file to a temporary location
```
  sudo cp /etc/munge/munge.key /tmp/munge.key && sudo chown user:user /tmp/munge.key
```
Replace user with the name of the user that you are running these commands as
1. Move the file to your compute node
```
  scp /etc/munge/munge.key <compute_node_name_or_ip>:/etc/tmp/munge.key
```
1. Move the file to the correct location
```
  ssh <computenode hostname or ip> 'sudo mv /tmp/munge.key /etc/munge/munge.key'
```
Start and enable the munge service

Install dependency packages:

sudo dnf install gcc openssl openssl-devel pam-devel numactl numactl-devel hwloc lua readline-devel ncurses-devel man2html libibmad libibumad rpm-build perl-Switch libssh2-devel mariadb-devel perl-ExtUtils-MakeMaker rrdtool-devel lua-devel hwloc-devel

Download the 20.11.9 version of the Slurm source code tarball (.tar.bz2) from https://download.schedmd.com/slurm/. Copy the URL for slurm-20.11.9.tar.bz2 from your browser and use the wget command to easily download files directly to your VM.
Environment variables are a convenient way to store a name and value for easier recovery when they're needed. Export the version of the tarball you downloaded to the environment variable VERSION. This will make installation easier as you will see how we reference the environment variable instead of typing out the version number at every instance.
```
  export VERSION=20.11.9
```
Build RPM packages for Slurm for installation
```
  sudo rpmbuild -ta slurm-$VERSION.tar.bz2
```
This should successfully generate Slurm RPMs in the directory that you invoked the rpmbuild command from.
Copy these RPMs to your compute node to install later, using scp.

Install Slurm server

  sudo dnf localinstall ~/rpmbuild/RPMS/x86_64/slurm-$VERSION*.rpm \
                        ~/rpmbuild/RPMS/x86_64/slurm-devel-$VERSION*.rpm \
                        ~/rpmbuild/RPMS/x86_64/slurm-example-configs-$VERSION*.rpm \
                        ~/rpmbuild/RPMS/x86_64/slurm-perlapi-$VERSION*.rpm \
                        ~/rpmbuild/RPMS/x86_64/slurm-slurmctld-$VERSION*.rpm

Setup Slurm server
```
  sudo cp /etc/slurm/slurm.conf.example /etc/slurm/slurm.conf
```
Edit this file (/etc/slurm/slurm.conf) and set appropriate values for:
```
ClusterName=      #Name of your cluster (whatever you want)
ControlMachine=   #DNS name of the head node
```
Populate the nodes and partitions at the bottom with the following two lines:
```
NodeName=<computenode> Sockets=<num_sockets> CoresPerSocket=<num_cpu_cores> \
ThreadsPerCore=<num_threads_per_core> State=UNKNOWN
```
```
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
```
To check how many cores your compute node has, run lscpu on the compute node. You will get output including CPU(s), Thread(s) per core, Core(s) per socket and more that will help you determine what to use for the Slurm configuration.

Hint: if you overspec your compute resources in the definition file then Slurm will not be able to use the nodes.
Create Necessary Directories and Set Permissions:

  sudo mkdir -p /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
  sudo chown -R slurm:slurm /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm

Start and enable the slurmctld service on the head node.

Compute Node Configuration (Clients)

Setup MUNGE:

 sudo dnf install munge munge-libs
  sudo scp /etc/munge/munge.key <compute_node_name_or_ip>:/etc/munge/munge.key
  sudo chown munge:munge /etc/munge/munge.key
  sudo chmod 400 /etc/munge/munge.key

Install Slurm Client

  sudo dnf localinstall ~/rpmbuild/RPMS/x86_64/slurm-$VERSION*.rpm \
                   ~/rpmbuild/RPMS/x86_64/slurm-slurmd-$VERSION*.rpm \
                   ~/rpmbuild/RPMS/x86_64/slurm-pam_slurm-$VERSION*.rpm

Copy /etc/slurm/slurm.conf from head node to compute node.

Create necessary directories:

sudo mkdir -p /var/spool/slurm/d
sudo chown slurm:slurm /var/spool/slurm/d

Start and enable the slurmd service.

Return to your head node. To demonstrate that your scheduler is working you can run the following command as your normal user:

  sinfo

You should see your compute node in an idle state.

Slurm allows for jobs to be submitted in batch (set-and-forget) or interactive (real-time response to the user) modes. Start an interactive session on your compute node via the scheduler with

  srun -N 1 --pty bash

You should automatically be logged into your compute node. This is done via Slurm. Re-run sinfo now and also run the command squeue. Here you will see that your compute node is now allocated to this job.

To finish, type exit and you'll be placed back on your head node. If you run squeue again, you will now see that the list is empty.

To confirm that your node configuration is correct, you can run the following command on the head node:

sinfo -alN

The S:C:T column means "sockets, cores, threads" and your numbers for your compute node should match the settings that you made in the slurm.conf file.

GROMACS Application Benchmark

You will now be extending some of your earlier work from Tutorial 3.

Protein Visualization

[!NOTE] You will need to work on your or laptop to complete this section, not on your head node nor compute node.

You are able to score bonus points for this tutorial by submitting a visualisation of your adh_cubic benchmark run. Follow the instructions below to accomplish this and upload the visualisation.

Download and install the VMD visualization tool by selecting the correct version for your operating system. For example, for a Windows machine with an Nvidia GPU select the “Windows OpenGL, CUDA” option. You may need to register on the website.

https://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=VMD

Use the WinSCP application for Windows, or the scp command for Linux to copy the output file confout.gro of the adh_cubic benchmark from your cluster to your PC. Attempting to visualise the larger "1.5M_water" simulation is not necessary and not recommended due to memory limitations of most PCs.

Open VMD, select File then New Module..., click Browse... and select your .gro file.
Ensure the filetype was detected as Gromacs GRO then click Load. In the main VMD window you will see that 134177 particles have been loaded. You should also see the display window has been populated with your simulation particle data.

You can manipulate the data with your mouse cursor: zoom with the mouse wheel or rotate it by dragging with the left mouse button held down. This visualisation presents a naturally occurring protein (blue/green) found in the human body, suspended in a solution of water molecules (red/white).
From the main VMD window, select Graphics then Representations...
Under Selected Atoms, replace all with not resname SOL and click apply. You will notice the water solution around your protein has been removed, allowing you to better examine the protein.
In the same window, select the dropdown Drawing Method and try out a few different options. Select New Cartoon before moving on.
From the main VMD window, once again select Graphics then Colors. Under Categories, select Display, then Background, followed by 8 white.
Finally, you are ready to render a snapshot of your visualisation. From the main window, select File then Render..., ensure Snapshot... is selected and enter an appropriate filename. Click Start Rendering.

Simulations like this are used to to develop and prototype experimental pharmaceutical drug designs. By visualising the output, researchers are able to better interpret simulation results.

[!TIP]

Copy the resulting .bmp file(s) from yout cluster to your local computer or laptop and demonstrate this to your instructors for bonus points.

Benchmark 2 (1.5M Water)

Caution

This is a large benchmark and can possibly take some time. Complete the next sections and come back to this if you feel as though your time is limited.

Pre-process the input data using the grompp command

gmx_mpi grompp -f pme_verlet.mdp -c out.gro -p topol.top -o md_0_1.tpr

Using a batch script similar to the one above, run the benchmark. You may modify the mpirun command to optimise performance (significantly) but in order to produce a valid result, the simulation must run for 5,000 steps. Quoted in the output as:

"5000 steps,     10.0 ps."

Note

Please be ready to present the gromacs_log files for the 1.5M_water benchmark to the instructors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Student Cluster Competition - Tutorial 4

Table of Contents

Checklist

Cluster Monitoring

Install Docker Engine, Containerd and Docker Compose

Installing your Monitoring Stack

Startup and Test the Monitoring Services

SSH Port Local Forwarding Tunnel

Create a Dashboard in Grafana

Success State, Next Steps and Troubleshooting

Prometheus

Node Exporter

SSH Tunneling

Grafana

Configuring and Connecting to your Remote JupyterLab Server

Visualize Your HPL Benchmark Results

Visualize Your Qiskit Results

Automating the Deployment of your OpenStack Instances Using Terraform

Install and Initialize Terraform

Generate `clouds.yml` and `main.tf` Files

Generate, Deploy and Apply Terraform Plan

Continuous Integration Using CircleCI

Prepare GitHub Repository

Reuse `providers.tf` and `main.tf` Terraform Configurations

Create `.circleci/config.yml` File and `push` Project to GitHub

Create CircleCI Account and Add Project

Slurm Scheduler and Workload Manager

Prerequisites

Head Node Configuration (Server)

Compute Node Configuration (Clients)

GROMACS Application Benchmark

Protein Visualization

Benchmark 2 (1.5M Water)

Files

README.md

Latest commit

History

README.md

File metadata and controls

Student Cluster Competition - Tutorial 4

Table of Contents

Checklist

Cluster Monitoring

Install Docker Engine, Containerd and Docker Compose

Installing your Monitoring Stack

Startup and Test the Monitoring Services

SSH Port Local Forwarding Tunnel

Create a Dashboard in Grafana

Success State, Next Steps and Troubleshooting

Prometheus

Node Exporter

SSH Tunneling

Grafana

Configuring and Connecting to your Remote JupyterLab Server

Visualize Your HPL Benchmark Results

Visualize Your Qiskit Results

Automating the Deployment of your OpenStack Instances Using Terraform

Install and Initialize Terraform

Generate clouds.yml and main.tf Files

Generate, Deploy and Apply Terraform Plan

Continuous Integration Using CircleCI

Prepare GitHub Repository

Reuse providers.tf and main.tf Terraform Configurations

Create .circleci/config.yml File and push Project to GitHub

Create CircleCI Account and Add Project

Slurm Scheduler and Workload Manager

Prerequisites

Head Node Configuration (Server)

Compute Node Configuration (Clients)

GROMACS Application Benchmark

Protein Visualization

Benchmark 2 (1.5M Water)

Generate `clouds.yml` and `main.tf` Files

Reuse `providers.tf` and `main.tf` Terraform Configurations

Create `.circleci/config.yml` File and `push` Project to GitHub