- Checklist
- Cluster Monitoring
- Configuring and Connecting to your Remote JupyterLab Server
- Automating the Deployment of your OpenStack Instances Using Terraform
- Continuous Integration Using CircleCI
- Slurm Scheduler and Workload Manager
- GROMACS Application Benchmark
This tutorial demonstrates cluster monitoring, data visualization, automated infrastructure as code deployment and workload scheduling. These components are critical to a typical HPC environment.
Monitoring is a widely used component in system administration (including enterprise datacentres and corporate networks). Monitoring allows administrators to be aware of what is happening on any system that is being monitored and is useful to proactively identify where any potential issues may be.
Interpreting and understanding your results and data, is vital to making meaning implementations of said data. You will also automate the provisioning and deployment of your "experimental", change management compute node. Lastly, a workload scheduler ensures that users' jobs are handled properly to fairly balance all scheduled jobs with the resources available at any time.
You will also cover data interpretation and visualization for previously run benchmark applications.
In this tutorial you will:
- Setup a monitoring stack using Docker Compose
- Install and setup the pre-requisites
- Create all the files required for configuring the 3 containers to be launched
- The docker-compose.yml file describing the Node-Exporter, Prometheus and Grafana services
- The prometheus.yml file describing the metrics to be scraped for each host involved
- The prometheus-datasource.yaml file describing the Prometheus datasource for Grafana
- Start the services
- Verify that they are running and accessible (locally, and externally)
- Create a dashboard in Grafana
- Login to the Grafana endpoint (via your browser)
- Import the appropriate Node-Exporter dashboard
- Check that the dashboard is working as expected
- Prepare, install and configure remote JupyterLab server
- Connect to JupyterLab and visualize benchmarking results
- Automate the provisioning and deployment of your Sebowa OpenStack infrastructure
- Install the Slurm workload manager across your cluster.
- Submit a test job to run on your cluster through the newly-configured workload manager.
Tip
You're going to be manipulating both your headnode, as well as your compute node(s) in this tutorial.
You are strongly advised to make use of a terminal multiplexer, such as tmux
before making a connection to your VMs. Once you're logged into your head node, initiate a tmux
session:
tmux
Then split the window into two separate panes with ctrl + b %
.
SSH into your compute node on the other pane.
Cluster monitoring is crucial for managing Linux machines. Effective monitoring helps detect and resolve issues promptly, provides insights into resource usage (CPU, memory, disk, network), aids in capacity planning, and ensures infrastructure scales with workload demands. By monitoring system performance and health, administrators can prevent downtime, reduce costs, and improve efficiency.
-
Traditional Approach Using
top
orhtop
Traditionally, Linux system monitoring involves command-line tools like
top
orhtop
. These tools offer real-time system performance insights, displaying active processes, resource usage, and system load. While invaluable for monitoring individual machines, they lack the ability to aggregate and visualize data across multiple nodes in a cluster, which is essential for comprehensive monitoring in larger environments. -
Using Grafana, Prometheus, and Node Exporter
Modern solutions use Grafana, Prometheus, and Node Exporter for robust and scalable monitoring. Prometheus collects and stores metrics, Node Exporter provides system-level metrics, and Grafana visualizes this data. This combination enables comprehensive cluster monitoring with historical data analysis, alerting capabilities, and customizable visualizations, facilitating better decision-making and faster issue resolution.
-
What is Docker and Docker Compose and How We Will Use It
Docker is a platform for creating, deploying, and managing containerized applications. Docker Compose defines and manages multi-container applications using a YAML file. For cluster monitoring on a Rocky Linux head node, we will use Docker and Docker Compose to bundle Grafana, Prometheus, and Node Exporter into deployable containers. This approach simplifies installation and configuration, ensuring all components are up and running quickly and consistently, streamlining the deployment of the monitoring stack.
Note
When the word Input: is mentioned, excpect the next line to have commands that you need to copy and paste into your own terminal.
Whenever the word Output: is mentioned DON'T copy and paste anything below this word as this is just the expected output.
The following configuration is for your head node. You will be advised of the steps you need to take to monitor your compute node(s) at the end.
You will need to have docker
, containerd
and docker-compose
installed on all the nodes that you want to eventually monitor, i.e. your head node and compute node(s).
-
Prerequisites and dependencies
Refer to the following RHEL Guide
- DNF / YUM
# The yum-utils package which provides the yum-config-manager utility sudo yum install -y yum-utils # Add and set up the repository for use. sudo yum-config-manager --add-repo https://download.docker.com/linux/rhel/docker-ce.repo
- APT
# Install required package dependencies sudo apt install apt-transport-https ca-certificates curl software-properties-common -y # Add the Docker repository curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
-
Installation
- DNF / YUM
# If prompted to accept the GPG key, verify that the fingerprint matches, accept it. sudo yum install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
- APT
sudo apt update sudo apt install docker-ce docker-ce-cli containerd.io -y
- Arch
sudo pacman -S docker # You need to start and enable docker, prior to installing containerd and docker-compose sudo pacman -S containerd docker-compose
-
Start and Enable Docker:
sudo systemctl start docker sudo systemctl enable docker
-
Install Docker-Compose on Ubuntu
- APT
sudo curl -L "https://github.com/docker/compose/releases/download/$(curl -s https://api.github.com/repos/docker/compose/releases/latest | grep -Po '"tag_name": "\K.*\d')/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose sudo chmod +x /usr/local/bin/docker-compose
-
Verify that the Docker Engine installation was successful by running the
hello-world
imageDownload and deploy a test image and run it inside a container. When the container runs, it prints a confirmation message and exits.
# Check the verions of Docker docker --version # Download and deplpoy a test image sudo docker run hello-world # Check your version of Docker Compose docker-compose --version
You have now successfully installed and started Docker Engine.
-
Create a suitable directory, e.g.
/opt/monitoring_stack
This which you’ll keep a number of important configuration files.
sudo mkdir /opt/monitoring_stack/ cd /opt/monitoring_stack/
-
Create and edit your monitoring configurations files
sudo nano /opt/monitoring_stack/docker-compose.yml
-
Add the following to the
docker-compose.yml
YAML fileversion: '3' services: node-exporter: image: prom/node-exporter ports: - "9100:9100" restart: always networks: - monitoring-network prometheus: image: prom/prometheus ports: - "9090:9090" restart: always volumes: - /opt/monitoring_stack/prometheus.yml:/etc/prometheus/prometheus.yml networks: - monitoring-network grafana: image: grafana/grafana ports: - "3000:3000" restart: always environment: GF_SECURITY_ADMIN_PASSWORD: <SET_YOUR_GRAFANA_PASSWORD> volumes: - /opt/monitoring_stack/prometheus-datasource.yaml:/etc/grafana/provisioning/datasources/prometheus-datasource.yaml networks: - monitoring-network networks: monitoring-network: driver: bridge
-
Create and edit your Prometheus configuration files
sudo nano /opt/monitoring_stack/prometheus.yml
-
Add the following to your
prometheus.yml
YAML fileglobal: scrape_interval: 15s scrape_configs: - job_name: 'node-exporter' static_configs: - targets: ['node-exporter:9100']
-
Configure you Promeheus Data Sources
sudo nano /opt/monitoring_stack/prometheus-datasource.yaml
-
Add the following to your
prometheus-datasource.yaml
.apiVersion: 1 datasources: - name: Prometheus type: prometheus access: proxy url: http://prometheus:9090
Tip
If you've successfully configured nftables, you will be required to open the following TCP ports 3000, 9090, 9100.
Bring up your monitoring stack and verify that the have been correctly configured
-
Bring up your monitoring stack
sudo docker compose up -d
-
Confirm the status of your Docker Containers
sudo docker ps
-
Dump the metrics that are being monitored from your services
# Prometheus curl -s localhost:9090/metrics | head # Node Exporter curl -s localhost:9100/metrics | head # Grafana curl -s localhost:3000 | head
Post the output of the above commands as comments to the Discussion on GitHub.
Congratulations on correctly configuring your monitoring services!
SSH port forwarding, also known as SSH tunneling, is a method of creating a secure connection between a local computer and a remote machine through an SSH (Secure Shell) connection. Local port forwarding allows you to forward a port on your local machine to a port on a remote machine. It is commonly used to access services behind a firewall or NAT.
Important
The following is included to demonstrate the concept of TCP Port Forwarding. In the next section, your are:
- Opening a TCP Forwarding Port and listening on Port 3000 on your workstation, i.e. http://localhost:3000
- You are then binding this SOCKET to TCP Port 3000 on your head node.
The following diagram may facilitate the discussion and illustrate the scenario:
[workstation:3000] ---- SSH Forwarding Tunnel ----> [head node:3000] ---- Grafana Service on head node
# Connect to Grafana's (head node) service directly from your workstation
[http://localhost:3000] ---- SSH Forwarding Tunnel ----> [Grafana (head node)]
Make sure that you understand the above concepts, as it will facilitate your understanding of the following considerations:
- If you have successfully configured WireGuard
[workstation:3000] ---- WireGuard VPN ----> [head node:3000] ---- Grafana Service on head node
# Connect to Grafana's (head node) service directly from your workstation
[http://<head node (private wiregaurd ip)>:3000] ---- WireGuard VPN ----> [Grafana (head node)]
- And / or if you have successfully configured ZeroTier
[workstation:3000] ---- ZeroTier VPN ----> [head node:3000] ---- Grafana Service on head node
# Connect to Grafana's (head node) service directly from your workstation
[http://<head node (private zerotier ip)>:3000] ---- ZeroTier VPN ----> [Grafana (head node)]
Caution
You need to ensure that you have understood the above discussions. This section on port forwarding, is included for situations where you do know have sudo
rights on the machine your are working on and cannot open ports or install applications via sudo
, then you can forward ports over SSH.
Take the time now however, to ensure that all of your team members understand that there are a number of methods with which you can access remote services on your head node:
- http://154.114.57.x:3000
- http://localhost:3000
- http://
<headnode wireguard ip>
:3000 - http://
<headnode zerotier ip>
:3000
Once you have understood the above considerations, you may proceed to create a TCP Port Forwarding tunnel, to connect your workstation's port, directly to your head node's, over a tunnel.
-
Create SSH Port Forwarding Tunnel on your local workstation
Open a new terminal and run the tunnel command (replace 157.114.57.x with your unique IP):
ssh -L 3000:localhost:3000 [email protected]
-
From a browser on your workstation navigate to the Grafana dashboard on your head node
-
Go to a browser and login to Grafana:
-
Login to you Grafana dashboards
username: admin password: <YOUR_GRAFANA_PASSWORD>
-
Go to Dashboards
-
Click on New then Import
-
Input: 1860 and click Load
-
Click on source: "Prometheus"
-
Click on Import:
Congratulations on successfully deploying your monitoring stack and adding Grafana Dashboards to visualize this.
If you've managed to successfully configure your dash boards for your head node, repeat the steps for deploying Node Exporter on your compute node(s).
Note
Should you have any difficulties running the above configuration, use the alternative process below to deploy your monitoring stack. Click on the heading to reveal content.
Installing your monitoring stack from pre-compiled binaries
For this tutorial we will install from pre-complied binaries.The installation and the configuration of Prometheus should be done on your headnode.
- Create a Prometheus user without login access, this will be done manually as shown below:
sudo useradd --no-create-home --shell /sbin/nologin prometheus
- Download the latest stable version of Prometheus from the official site using
wget
wget https://github.com/prometheus/prometheus/releases/download/v2.33.1/prometheus-2.33.1.linux-amd64.tar.gz
- Long list file to verify Prometheus was downloaded
ll
- Extract the downloaded archive and move prometheus binaries to the /usr/local/bin directory.
tar -xvzf prometheus-2.33.1.linux-amd64.tar.gz
cd prometheus-2.33.1.linux-amd64
sudo mv prometheus promtool /usr/local/bin/
- Move back to the home directory, create directorise for prometheus.
cd ~
sudo mkdir /etc/prometheus
sudo mkdir /var/lib/prometheus
- Set the correct ownership for the prometheus directories
sudo chown prometheus:prometheus /etc/prometheus/
sudo chown prometheus:prometheus /var/lib/prometheus
- Move the configuration file and set the correct permissions
cd prometheus-2.33.1.linux-amd64
sudo mv consoles/ console_libraries/ prometheus.yml /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus/
-
Configure Prometheus
Edit the/etc/prometheus/prometheus.yml
file to configure your targets(compute node)Hint : Add the job configuration for the compute_node in the scrape_configs section of your Prometheus YAML configuration file. Ensure that all necessary configurations for this job are correctly placed within the relevant sections of the YAML file.:
global:
scrape_interval: 15s
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "compute_node"
static_configs:
- targets: ["<compute_node_ip>:9100"]
- Create a service file to manage Prometheus with
systemctl
, the file can be created with the text editornano
(Can use any text editor of your choice)
sudo nano /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
Restart=on-failure
RestartSec=5s
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries
[Install]
WantedBy=multi-user.target
- Reload the systemd daemon, start and enable the service
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
- Check that your service is active by checking the status
sudo systemctl status prometheus
[!TIP] If when you check the status and find that the service is not running, ensure SELinux or AppArmor is not restricting Prometheus from running. Try disabling SELinux/AppArmor temporarily to see if it resolves the issue:
sudo setenforce 0Then repeat steps 10 and 11.
If the prometheus service still fails to start properly, run the command
journalctl –u prometheus -f --no-pager
and review the output for errors.
[!IMPORTANT] If you have a firewall running, add a TCP rule for port 9090
Verify that your prometheus configuration is working navigating to http://<headnode_ip>:9090
in your web browser, access prometheus web interface. Ensure that the headnode_ip
is the public facing ip.
Node Exporter is a Prometheus exporter specifically designed for hardware and OS metrics exposed by Unix-like kernels. It collects detailed system metrics such as CPU usage, memory usage, disk I/O, and network statistics. These metrics are exposed via an HTTP endpoint, typically accessible at <node_ip>:9100/metrics
. The primary role of Node Exporter is to provide a source of system-level metrics that Prometheus can scrape and store. This exporter is crucial for gaining insights into the health and performance of individual nodes within a network.
The installation and the configuration node exporter will be done on the compute node/s
- Create a Node Exporter User
sudo adduser -M -r -s /sbin/nologin node_exporter
- Download and Install Node Exporter, this is done using
wget
as done before
cd /usr/src/
sudo wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
sudo tar xvf node_exporter-1.6.1.linux-amd64.tar.gz
- Next, move the node exporter binary file to the directory '/usr/local/bin' using the following command
mv node_exporter-*/node_exporter /usr/local/bin
- Create a service file to manage Node Exporter with
systemctl
, the file can be created with the text editornano
(Can use any text editor of your choice)
sudo nano /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
[!IMPORTANT] If firewalld is enabled and running, add a rule for port 9100
sudo firewall-cmd --permanent --zone=public --add-port=9100/tcp sudo firewall-cmd --reload
- Reload the systemd daemon, start and enable the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
- Check that your service is active by checking the status
sudo systemctl status node_exporter
In order to verify that node exporter is set up correctly we need to access <node_ip>:9100/metrics
. This can only been done by simply going to your broswer and putting it in as we did with Prometheus, we need to use a SSH tunnel.
What is SSH Tunneling?
SSH tunneling, also known as SSH port forwarding, is a method of securely forwarding network traffic from one network node to another via an encrypted SSH connection. It allows you to securely transmit data over untrusted networks by encrypting the traffic.
Why Use SSH Tunneling in This Scenario?
In this setup, the compute node has only a private IP and is not directly accessible from the internet. The headnode, however, has both a public IP (accessible from the internet) and a private IP (in the same network as the compute node).
Using SSH tunneling allows us to:
-
Access Restricted Nodes: Since the compute node is only reachable from the headnode, we can create an SSH tunnel through the headnode to access the compute node.
-
Secure Transmission: The tunnel encrypts the traffic between your local machine and the compute node, ensuring that any data sent through this tunnel is secure.
-
Simplify Access: By tunneling the Node Exporter port (9100) from the compute node to your local machine, you can access the metrics as if they were running locally, making it easier to monitor and manage the compute node.
- Set Up SSH Tunnel on Your Local Machine
ssh -L 9100:compute_node_ip:9100 user@headnode_ip -N
-
ssh -L: This option specifies local port forwarding. It maps a port on your local machine (first 9100) to a port on a remote machine (second 9100 on compute_node_ip) via the SSH server (headnode).
-
compute_node_ip:9100: The target address and port on the compute node where Node Exporter is running. user@headnode_ip: The SSH connection details for the headnode.
-
-N: Tells SSH to not execute any commands, just set up the tunnel.
- By navigating to http://localhost:9100/metrics in your web browser, you can access the Node Exporter metrics from the compute node as if the service were running locally on your machine.
Grafana is an open-source platform for monitoring and observability, known for its capability to create interactive and customizable dashboards. It integrates seamlessly with various data sources, including Prometheus. Through its user-friendly interface, Grafana allows users to build and execute queries to visualize data effectively. Beyond visualization, Grafana also supports alerting based on the visualized data, enabling users to set up notifications for specific conditions. This makes Grafana a powerful tool for both real-time monitoring and historical analysis of system performance.
Now we go back to the headnode for the installation and the configuration of Grafana
- Add the Grafana Repository, by adding the following directives in this file:
sudo nano /etc/yum.repos.d/grafana.repo
[grafana]
name=grafana
baseurl=https://rpm.grafana.com
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://rpm.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
exclude=*beta*
- Install Grafana
sudo dnf install grafana -y
- Start and Enable Grafana
sudo systemctl start grafana-server
sudo systemctl enable grafana-server
- Check the status of grafana-server
sudo systemctl status grafana-server
[!IMPORTANT] If firewalld is enabled and running, add a rule for port 9100
sudo firewall-cmd --permanent --zone=public --add-port=3000/tcp sudo firewall-cmd --reload
Project Jupyter provides powerful tools for scientific investigations due to their interactive and flexible nature. Here are some key reasons why they are favored in scientific research.
-
Interactive Computing and Immediate Feedback
Run code snippets and see the results immediately, which helps in quick iterations and testing of hypotheses. Directly plot graphs and visualize data within the notebook, which is crucial for data analysis.
-
Documentation and Rich Narrative Text
Combine code with Markdown text to explain the methodology, document findings, and write detailed notes. Embed images, videos, and LaTeX equations to enhance documentation and understanding.
-
Reproducibility
Share notebooks with others to ensure that they can reproduce the results by running the same code. Use tools like Git to version control the notebooks, ensuring a record of changes and collaborative development.
-
Data Analysis and Visualization
Utilize a wide range of Python libraries such as NumPy, Pandas, Matplotlib, and Seaborn for data manipulation and visualization. Perform exploratory data analysis (EDA) seamlessly with powerful plotting libraries.
Jupyter Notebooks provide a versatile and powerful environment for conducting scientific investigations, facilitating both the analysis and the clear communication of results.
-
Start by installing all the prerequisites
You would have already installed most these from Qiskit Benchmark in tutorial 3.
- DNF / YUM
# RHEL, Rocky, Alma, CentOS Stream sudo dnf install python python-pip
- APT
# Ubuntu sudo apt install python python-pip
- Pacman
# Arch sudo pacman -S python python-pip
- DNF / YUM
-
Open TCP port 8889 on your nftables firewall, and restart the service
sudo nano /etc/nftables/hn.nft sudo systemctl restart nftables
Tip
There are a number of plotting utilities available in Python. Each with their own advantages and disadvantages. You will be using Plotly in the following exercises.
You will now visualize the results from the table you prepared of Rmax (GFlops/s) scores for different configurations of HPL.
-
Create and Activate a New Python Virtual Environment
Separate your python projects and ensure that they exist in their own, clean environments:
python -m venv hplScores source hplScores/bin/activate
-
Install Project Jupyter and Plotly plotting utilities and dependencies
pip install jupyterlab ipywidgets plotly jupyter-dash
-
Start the JupyterLab server
jupyter lab --ip 0.0.0.0 --port 8889 --no-browser
--ip
binds to all interfaces on your head node, including the public facing address--port
bind to the port that you granted access to innftables
- --no-browser, do not try to launch a browser directly on your head node.
-
Carefully copy your
<TOKEN>
from the command line after successfully launching your JupyterLab server.# Look for a line similar to the one below, and carefully copy your <TOKEN> http://127.0.0.1:8889/lab?token=<TOKEN>
-
Open a browser on you workstation and navigate to your JupyterLab server on your headnode:
http://<headnode_public_ip>:8889
-
Login to your JupyterLab server using your
<TOKEN>
. -
Create a new Python Notebook and plot your HPL results:
import plotly.express as px x=["Head [<treads>]", "Compute Repo MPI and BLAS [<threads>]", "Compute Compiled MPI and BLAS [<threads>]", "Compute Intel oneAPI Toolkits", "Two Compute Nodes", "etc..."] y=[<gflops_headnode>, <gflops_compute>, <gflops_compute_compiled_mpi_blas>, <gflops_compute_intel_oneapi>, <gflops_two_compute>, <etc..>] fig = px.bar(x, y) fig.show()
-
Click on the camera icon to download and save your image. Post your results as a comment, replying to this GitHub discussion thread.
You are now going to extend your qv_experiment
and plot your results, by drawing a graph of "Number of Qubits vs Simulation time to Solution":
-
Create and Activate a New Python Virtual Environment
Separate your python projects and ensure that they exist in their own, clean environments:
python -m venv source QiskitAer/bin/activate
-
You may need to install additional dependencies
pip install matplotlib jupyterlab
-
Append the following to your
qv_experiment.py
script:# number of qubits, for your system see how much higher that 30 your can go... num_qubits = np.arrange(2, 10) # QV Depth qv_depth = 5 # For bonus points submit results with up to 20 or even 30 shots # Note that this will be more demanding on your system num_shots = 10 # Array for storing the output results result_array = [[], []] # iterate over qv depth and number of qubits for i in num_qubits: result_array[i] = quant_vol(qubits=i, shots=num_shots, depth=qv_depth) # for debugging purposes you can optionally print the output print(i, result_array[i]) import matplotlib.pyplot as plt plt.xlabel('Number of qubits') plt.ylabel('Time (sec)') plt.plot(num_qubits, results_array) plt.title('Quantum Volume Experiment with depth=' + str(qv_depth)) plt.savefig('qv_experiment.png')
-
Run the benchmark by executing the script you've just written:
python qv_experiment.py
Terraform is a piece of software that allows one to write out their cloud infrastructure and deployments as code, IaC. This allows the deployments of your cloud virtual machine instances to be shared, iterated, automated as needed and for software development practices to be applied to your infrastructure.
In this section of the tutorial, you will be deploying an additional compute node from your head node
using Terraform.
Caution
In the following section, you must request additional resources from the instructors. This additional node will be experimental for testing your changes to your cluster before committing them to your active compute nodes. You will be deleting and reinitializing this instance often. Make sure you understand how to Delete Instance.
You will now prepare, install and initialize Terraform on your head node. You will define and configure a providers.tf
file, to configure OpenStack instances (as Sebowa is an OpenStack based cloud).
-
Use your operating system's package manager to install Terraform
This could be your workstation or one of your VMs. The machine must be connected to the internet and have access to your OpenStack workspace, i.e. https://sebowa.nicis.ac.za
- DNF / YUM
sudo yum update -y # Install package to manage repository configurations sudo yum install -y dnf-plugins-core # Add the HashiCorp Repo sudo dnf config-manager --add-repo https://rpm.releases.hashicorp.com/RHEL/hashicorp.repo sudo dnf install -y terraform
- APT
# Update package repository sudo apt-get update sudo apt-get install -y gnupg software-properties-common # Add HashiCorp GPG Keys wget -O- https://apt.releases.hashicorp.com/gpg | gpg --dearmor | sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg # Add the official HashiCorp Linux Repo echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
- Pacman
# Arch sudo pacman -S terraform
-
Create a Terraform directory, descend into it and Edit the
providers.tf
filemkdir terraform cd terraform vim providers.tf
-
You must specify a Terraform Provider
These can vary from MS Azure, AWS, Google, Kubernetes etc... We will be implementing an OpenStack provider as this is what is implemented on the Sebowa cloud platform. Add the following to the
providers.tf
file.terraform { required_providers { openstack = { source = "terraform-provider-openstack/openstack" version = "1.46.0" } } }
-
Initialize Terraform
From the folder with your provider definition, execute the following command:
terraform init
Generate and configure the cloud.yml
file that will authenticate you against your Sebowa OpenStack workspace, and the main.tf
files that will define how your infrastructure should be provisioned.
-
Generate OpenStack API Credentials
From your team's Sebowa workspace, navigate to
Identity
→Application Credentials
, and generate a set of OpenStack credentials in order to allow you to access and authenticate against your workspace. -
Download and Copy the
clouds.yml
FileCopy the
clouds.yml
file to the folder where you initialized terraform. The contents of the of which, should be similar to:# This is a clouds.yaml file, which can be used by OpenStack tools as a source # of configuration on how to connect to a cloud. If this is your only cloud, # just put this file in ~/.config/openstack/clouds.yaml and tools like # python-openstackclient will just work with no further config. (You will need # to add your password to the auth section) # If you have more than one cloud account, add the cloud entry to the clouds # section of your existing file and you can refer to them by name with # OS_CLOUD=openstack or --os-cloud=openstack clouds: openstack: auth: auth_url: https://sebowa.nicis.ac.za:5000 application_credential_id: "<YOUR TEAM's APPLICATION CREDENTIAL ID" application_credential_secret: "<YOUR TEAM's APPLICATION CREDENTIAL SECRET>" region_name: "RegionOne" interface: "public" identity_api_version: 3 auth_type: "v3applicationcredential"
-
Create
main.tf
Terraform File Inside yourterraform
folder, you must define amain.tf
file. This file is used to identify the provider to be implemented as well as the compute resource configuration details of the instance we would like to launch.You will need to define your own
main.tf
file, but below is an example of one such definition:provider "openstack" { cloud = "openstack" } resource "openstack_compute_instance_v2" "terraform-demo-instance" { name = "scc24-arch-cn03" image_id = "33b938c8-6c07-45e3-8f2a-cc8dcb6699de" flavor_id = "4a126f4f-7df6-4f95-b3f3-77dbdd67da34" key_pair = "nlisa at mancave" security_groups = ["default", "ssc24_sq"] network { name = "nlisa-vxlan" } }
Note
You must specify your own variables for name
, image_id
, flavor_id
, key_pair
and network.name
.
-
Generate and Deploy Terraform Plan Create a Terraform plan based on the current configuration. This plan will be used to implement changes to your Sebowa OpenStack cloud workspace, and can be reviewed before applying those changes. Generate a plan and write it to disk:
terraform plan -out ~/terraform/plan
-
Once you are satisfied with the proposed changes, deploy the terraform plan:
terraform apply ~terraform/plan
-
Verify New Instance Successfully Created by Terraform Finally confirm that your new instance has been successfully created. On your Sebowa OpenStack workspace, navigate to
Project
→Compute
→Instances
.
Tip
To avoid losing your team's progress, it would be a good idea to create a GitHub repo in order for you to commit and push your various scripts and configuration files.
Circle CI is a Continuous Integration and Continuous Delivery platform that can be utilized to implement DevOps practices. It helps teams build, test, and deploy applications quickly and reliably.
In this section of the tutorials you're going to be expanding on the OpenStack instance automation with CircleCI Workflows
and Pipelines
. For this tutorial you will be using your GitHub account which will integrate directly into CircleCI.
You will be integration GitHub into CircleCI workflows, wherein every time you commit changes to your deploy_compute
GitHub repository, CircleCI will instantiate and trigger Terraform, to create a new compute node VM on Sebowa.
-
Create GitHub Repository If you haven't already done so, sign up for a GitHub Account. Then create an empty private repository with a suitable name, i.e.
deploy_compute_node
: -
Add your team members to the repository to provide them with access:
-
If you haven't already done so, add your SSH key to your GitHub account by following the instructions from Steps to follow when editing existing content.
Tip
You will be using your head node to orchestrate and configure your infrastructure. Pay careful attention to ensure that you copy over your head node's public SSH key. Administrating and managing your compute nodes in this manner requires you to think about them as "cattle" and not "pets".
-
On your head node, create a folder that is going to be used to initialize the GitHub repository:
mkdir ~/deploy_compute_node cd ~/deploy_compute_node
-
Copy the
providers.tf
andmain.tf
files you had previously generated:cp ~/terraform/providers.tf ./ cp ~/terraform/main.tf ./ vim main.tf
The .circle/config.yml
configuration file is where you define your build, test and deployment process. From your head node, you are going to be pushing
your Infrastructure as Code to your private GitHub repository. This will then automatically trigger the CircleCI deployment of a Docker container which has been tailored for Terraform operations and instructions that will deploy your Sebowa OpenStack compute node instance.
-
Create and edit
.circleci/config.yml
:mkdir .circleci vim .circleci/config.yml # Remember that if you are not comfortable using Vim, install and make use of Nano
-
Copy the following configuration into
.circle/config.yml
:version: 2.1 jobs: deploy: docker: - image: hashicorp/terraform:latest steps: - checkout - run: name: Create clouds.yaml command: | mkdir -p ~/.config/openstack echo "clouds: openstack: auth: auth_url: https://sebowa.nicis.ac.za:5000 application_credential_id: ${application_credential_id} application_credential_secret: ${application_credential_secret} region_name: "RegionOne" interface: "public" identity_api_version: 3 auth_type: "v3applicationcredential"" > ~/.config/openstack/clouds.yaml - run: name: Terraform Init command: terraform init - run: name: Terraform Apply command: terraform apply -auto-approve workflows: version: 2 deploy_workflow: jobs: - deploy
- Version: Specifies the configuration version.
- Jobs: Defines the individual steps in the build process, where we've defined a
build
job that runs inside the latest Terraform Docker container from Hashicorp. - Steps: The steps to execute within the job:
checkout
: Clone and checkout the code from the repository.run
: Executes a number of shell commands to create theclouds.yaml
file, then initialize and apply the Terraform configuration.
- Workflows: Defines the workflow(s) that CircleCI will follow, where in this instance there is a single workflow specified
deploy_workflow
, that runs thedeploy
job.
-
Init
ialize the Git Repository,add
the files you've just created andpush
to GitHub: Following the instructions from the previous section where you created a new GitHub repo, execute the following commands from your head node, inside thedeploy_compute_node
folder:cd ~/deploy_compute_node git init git add . git commit -m "Initial Commit." # You may be asked to configure you Name and Email. Follow the instructions on the screen before proceeding. git branch -M main git remote add origin [email protected]:<TEAM_NAME>/deploy_compute_node.git git push -u origin main
The new files should now be available on GitHub.
Navigate to CircleCI.com to create an account, link and add a new GitHub project.
- Create a new organization and give it a suitable name
- Once you've logged into your workspace, go to projects and create a new project
- Create a new IaC Project
- If your repository is on GitHub, create a corresponding project
- Pick a project name and a repository to associate it to
- Push the configuration to GitHub to trigger workflow
Important
You're going to need to delete your experimental compute node instance on your Sebowa OpenStack workspace, each time you want to test or run the CircleCI integration. It has been included here for demonstration purposes, so that you may begin to see the power and utility of CI/CD and automation.
Navigate to your Sebowa OpenStack workspace to ensure that they deployment was successful.
Consider how you could streamline this process even further using preconfigured instance snapshots, as well as Ansible after your instances have been deployed.
The Slurm Workload Manager (formerly known as Simple Linux Utility for Resource Management), is a free and open-source job scheduler for Linux, used by many of the world's supercomputers/computer clusters. It allows you to manage the resources of a cluster by deciding how users get access for some duration of time so they can perform work. To find out more, please visit the Slurm Website.
-
Make sure the clocks, i.e. chrony daemons, are synchronized across the cluster.
-
Generate a SLURM and MUNGE user on all of your nodes:
- If you have Ansible User Module working
- Create the users as shown in tutorial 2 Do NOT add them to the sysadmin group.
- If you do NOT have your Ansible User Module working
useradd slurm
- Ensure that users and groups (UIDs and GIDs) are synchronized across the cluster. Read up on the appropriate /etc/shadow and /etc/password files.
- If you have Ansible User Module working
-
Install the MUNGE package. MUNGE is an authentication service that makes sure user credentials are valid and is specifically designed for HPC use.
First, we will enable the EPEL (Extra Packages for Enterprise Linux) repository for
dnf
, which contains extra software that we require for MUNGE and Slurm:sudo dnf install epel-release
Then we can install MUNGE, pulling the development source code from the
crb
"CodeReady Builder" repository:sudo dnf config-manager --set-enabled crb sudo dnf install munge munge-libs munge-devel
-
Generate a MUNGE key for client authentication:
sudo /usr/sbin/create-munge-key -r sudo chown munge:munge /etc/munge/munge.key sudo chmod 600 /etc/munge/munge.key
-
Using
scp
, copy the MUNGE key to your compute node to allow it to authenticate:-
SSH into your compute node and create the directory
/etc/munge
. Then exit back to the head node. -
Since, munge has not yet been installed on your compute node, first transfer the file to a temporary location
sudo cp /etc/munge/munge.key /tmp/munge.key && sudo chown user:user /tmp/munge.key
Replace user with the name of the user that you are running these commands as
- Move the file to your compute node
scp /etc/munge/munge.key <compute_node_name_or_ip>:/etc/tmp/munge.key
- Move the file to the correct location
ssh <computenode hostname or ip> 'sudo mv /tmp/munge.key /etc/munge/munge.key'
-
-
Start and enable the
munge
service -
Install dependency packages:
sudo dnf install gcc openssl openssl-devel pam-devel numactl numactl-devel hwloc lua readline-devel ncurses-devel man2html libibmad libibumad rpm-build perl-Switch libssh2-devel mariadb-devel perl-ExtUtils-MakeMaker rrdtool-devel lua-devel hwloc-devel
-
Download the 20.11.9 version of the Slurm source code tarball (.tar.bz2) from https://download.schedmd.com/slurm/. Copy the URL for
slurm-20.11.9.tar.bz2
from your browser and use thewget
command to easily download files directly to your VM. -
Environment variables are a convenient way to store a name and value for easier recovery when they're needed. Export the version of the tarball you downloaded to the environment variable VERSION. This will make installation easier as you will see how we reference the environment variable instead of typing out the version number at every instance.
export VERSION=20.11.9
-
Build RPM packages for Slurm for installation
sudo rpmbuild -ta slurm-$VERSION.tar.bz2
This should successfully generate Slurm RPMs in the directory that you invoked the
rpmbuild
command from. -
Copy these RPMs to your compute node to install later, using
scp
. -
Install Slurm server
sudo dnf localinstall ~/rpmbuild/RPMS/x86_64/slurm-$VERSION*.rpm \ ~/rpmbuild/RPMS/x86_64/slurm-devel-$VERSION*.rpm \ ~/rpmbuild/RPMS/x86_64/slurm-example-configs-$VERSION*.rpm \ ~/rpmbuild/RPMS/x86_64/slurm-perlapi-$VERSION*.rpm \ ~/rpmbuild/RPMS/x86_64/slurm-slurmctld-$VERSION*.rpm
-
Setup Slurm server
sudo cp /etc/slurm/slurm.conf.example /etc/slurm/slurm.conf
Edit this file (
/etc/slurm/slurm.conf
) and set appropriate values for:ClusterName= #Name of your cluster (whatever you want) ControlMachine= #DNS name of the head node
Populate the nodes and partitions at the bottom with the following two lines:
NodeName=<computenode> Sockets=<num_sockets> CoresPerSocket=<num_cpu_cores> \ ThreadsPerCore=<num_threads_per_core> State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
To check how many cores your compute node has, run
lscpu
on the compute node. You will get output includingCPU(s)
,Thread(s) per core
,Core(s) per socket
and more that will help you determine what to use for the Slurm configuration.Hint: if you overspec your compute resources in the definition file then Slurm will not be able to use the nodes.
-
Create Necessary Directories and Set Permissions:
sudo mkdir -p /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
sudo chown -R slurm:slurm /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
- Start and enable the
slurmctld
service on the head node.
-
Setup MUNGE:
sudo dnf install munge munge-libs sudo scp /etc/munge/munge.key <compute_node_name_or_ip>:/etc/munge/munge.key sudo chown munge:munge /etc/munge/munge.key sudo chmod 400 /etc/munge/munge.key
-
Install Slurm Client
sudo dnf localinstall ~/rpmbuild/RPMS/x86_64/slurm-$VERSION*.rpm \
~/rpmbuild/RPMS/x86_64/slurm-slurmd-$VERSION*.rpm \
~/rpmbuild/RPMS/x86_64/slurm-pam_slurm-$VERSION*.rpm
-
Copy
/etc/slurm/slurm.conf
from head node to compute node. -
Create necessary directories:
sudo mkdir -p /var/spool/slurm/d sudo chown slurm:slurm /var/spool/slurm/d
-
Start and enable the
slurmd
service.
Return to your head node. To demonstrate that your scheduler is working you can run the following command as your normal user:
sinfo
You should see your compute node in an idle state.
Slurm allows for jobs to be submitted in batch (set-and-forget) or interactive (real-time response to the user) modes. Start an interactive session on your compute node via the scheduler with
srun -N 1 --pty bash
You should automatically be logged into your compute node. This is done via Slurm. Re-run sinfo
now and also run the command squeue
. Here you will see that your compute node is now allocated to this job.
To finish, type exit
and you'll be placed back on your head node. If you run squeue
again, you will now see that the list is empty.
To confirm that your node configuration is correct, you can run the following command on the head node:
sinfo -alN
The S:C:T
column means "sockets, cores, threads" and your numbers for your compute node should match the settings that you made in the slurm.conf
file.
You will now be extending some of your earlier work from Tutorial 3.
[!NOTE] You will need to work on your or laptop to complete this section, not on your head node nor compute node.
You are able to score bonus points for this tutorial by submitting a visualisation of your adh_cubic benchmark run. Follow the instructions below to accomplish this and upload the visualisation.
Download and install the VMD visualization tool by selecting the correct version for your operating system. For example, for a Windows machine with an Nvidia GPU select the “Windows OpenGL, CUDA” option. You may need to register on the website.
https://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=VMD
Use the WinSCP
application for Windows, or the scp
command for Linux to copy the output file confout.gro
of the adh_cubic benchmark from your cluster to your PC. Attempting to visualise the larger "1.5M_water" simulation is not necessary and not recommended due to memory limitations of most PCs.
-
Open VMD, select File then New Module..., click Browse... and select your
.gro
file. -
Ensure the filetype was detected as Gromacs GRO then click Load. In the main VMD window you will see that 134177 particles have been loaded. You should also see the display window has been populated with your simulation particle data.
You can manipulate the data with your mouse cursor: zoom with the mouse wheel or rotate it by dragging with the left mouse button held down. This visualisation presents a naturally occurring protein (blue/green) found in the human body, suspended in a solution of water molecules (red/white).
-
From the main VMD window, select Graphics then Representations...
-
Under Selected Atoms, replace all with not resname SOL and click apply. You will notice the water solution around your protein has been removed, allowing you to better examine the protein.
-
In the same window, select the dropdown Drawing Method and try out a few different options. Select New Cartoon before moving on.
-
From the main VMD window, once again select Graphics then Colors. Under Categories, select Display, then Background, followed by 8 white.
-
Finally, you are ready to render a snapshot of your visualisation. From the main window, select File then Render..., ensure Snapshot... is selected and enter an appropriate filename. Click Start Rendering.
Simulations like this are used to to develop and prototype experimental pharmaceutical drug designs. By visualising the output, researchers are able to better interpret simulation results.
[!TIP]
Copy the resulting
.bmp
file(s) from yout cluster to your local computer or laptop and demonstrate this to your instructors for bonus points.
Caution
This is a large benchmark and can possibly take some time. Complete the next sections and come back to this if you feel as though your time is limited.
Pre-process the input data using the grompp
command
gmx_mpi grompp -f pme_verlet.mdp -c out.gro -p topol.top -o md_0_1.tpr
Using a batch script similar to the one above, run the benchmark. You may modify the mpirun command to optimise performance (significantly) but in order to produce a valid result, the simulation must run for 5,000 steps. Quoted in the output as:
"5000 steps, 10.0 ps."
Note
Please be ready to present the gromacs_log
files for the 1.5M_water benchmark to the instructors.