This repo contains an example MPI application that is built into a container that can be used for an NNF Container Workflow. The application uses the dynamic storage created by the workflow. The storage path is passed into the hello world application.
The overall process for creating a container workflow is as follows:
- A container image is created from this repo
- An NNF Container Profile is created on the NNF Kubernetes cluster:
- The container image is specified in the profile along with the command to run
- A required GFS2 storage is defined in the profile
- The command also includes an environment
- A Workflow is created and contains two directives:
- Directive for the GFS2 filesystem (that matches the GFS2 storage name in the profile)
- Directive for the container (that specifies the profile)
- Workflow is progressed to the PreRun stage, where the container starts
To start, you must have a working container image that includes your application. This image and your application are used in the NNF Container Profile to instruct the container workflow to run your application. Adding an NNF Container Profile and container image may require elevated privileges. Please work with your system administrator to get these on your system.
The Dockerfile
in this repository creates an example image that can be used to drive container
workflows.
When building your own image, ensure it meets the following requirements coupled with your user application.
Any container image that is built must be available in an image registry that is available on your cluster. See your cluster administrator for more details.
In this example, we're using the GitHut Container Registry (ghcr.io), so your cluster must have internet access to retrieve the image.
For MPI applications, the container must include the following:
- open-mpi
- MPI File Utils
- ssh server
- nslookup
The easiest way to do this is to use the NNF MFU (MPI File Utils) image in your Dockerfile:
FROM ghcr.io/nearnodeflash/nnf-mfu:master
Using this image ensures that your image contains the necessary software to run MPI applications across Kubernetes pods that are running on NNF nodes.
Once you have a working container image, it's time to create an NNF Container Profile. The profile is used to define the storages that you expect to use with your application. It also defines how you run your application.
In this example, we are expecting to have 1 non-optional storage called DW_JOB_my_storage
. If the
storage is persistent storage, then it must start with DW_PERSISTENT
rather than DW_JOB
. Filesystem types are not defined here, but later in the DW directive.
---
apiVersion: nnf.cray.hpe.com/v1alpha2
kind: NnfContainerProfile
metadata:
name: demo
namespace: nnf-system
data:
storages:
- name: DW_JOB_my_storage
optional: false
The name of this storage must be present in the container directive - more on that later.
Next, we define the container specification. For MPI applications, this is done using mpiSpec
. The
mpiSpec
allows us to define the Launcher and Worker containers.
apiVersion: nnf.cray.hpe.com/v1alpha2
kind: NnfContainerProfile
metadata:
name: demo
namespace: nnf-system
data:
storages:
- name: DW_JOB_my_storage
optional: false
mpiSpec:
mpiReplicaSpecs:
Launcher:
template:
spec:
containers:
- name: nnf-container-example
image: ghcr.io/nearnodeflash/nnf-container-example:master
command:
- mpirun
- --tag-output
- mpi_hello_world
- "$(DW_JOB_my_storage)"
Worker:
template:
spec:
containers:
- name: nnf-container-example
image: ghcr.io/nearnodeflash/nnf-container-example:master
Both the Launcher
and Worker
must be defined. The main pieces here are to set the images for both
and the command for the Launcher
. Boiled down, these are just Kubernetes PodTemplateSpecs.
The container image tag may need to be changed from "master" to match the tag for your build.
Our mpi_hello_world
application takes in a command line argument for the storage file path. We are
using the name of the storage we defined above in the storages
object to pass into our command:
command:
- mpirun
- mpi_hello_world
- "$(DW_JOB_my_storage)"
For the full definition of the MPIJobSpec
provided by mpi-operator
, see the definition
here. However,
some of these values are overridden by NNF software and not all configurable options have been
tested.
For a full understanding of the other options in an NNF Container Profile, see the nnf-sos samples and examples.
With a container image and an NNF Container Profile, we are now ready to create a Workflow.
The workflow definition will include two DW Directives:
#DW jobdw
to create the storage defined in the profile#DW container
to create the containers and map the storage/profile.
First, we'll create the storage. We will be using GFS2 for the filesystem:
#DW jobdw name=demo-gfs2 type=gfs2 capacity=50GB
Then, define the container. Note the DW_JOB_my_storage=demo-gfs2
argument matches what is in the NNF Container Profile and maps it to the name of the GFS2 filesystem created in the DW Directive above.
#DW container name=demo-container profile=demo DW_JOB_my_storage=demo-gfs2
Note: The Flux workload manager may take care of all or most of the steps in this section. This is the manual way of doing things. You may want to consult a Flux expert on how to drive a container workflow using Flux.
With a working Kubernetes cluster, the previous examples can be put together and deployed on the system. The files in this repository have done that.
Deploy the profile and create the workflow:
kubectl apply -f nnf-container-example.yaml
For container directives, compute nodes must be assigned to the workflow. The NNF software will trace the compute nodes back to their local NNF nodes and the container will be executed on those NNF nodes. The act of assigning compute nodes to your container workflow instructs the NNF software to select the NNF nodes that run the containers.
For the jobdw
directive that is included, we must define the servers (i.e. NNF nodes) and the computes.
Update the servers
and computes
resources assigned to the workflow.
Note: you must change
the node names in the allocation-*.yaml
files to match your system. allocationCount
must match
the number of compute nodes being targeted for that particular NNF node. In the example,
rabbit-node-1
is attached to compute-node-1
and compute-node-2
, etc.
kubectl patch --type merge --patch-file=allocation-servers.yaml servers demo-container-0
kubectl patch --type merge --patch-file=allocation-computes.yaml computes demo-container
At this point, the workflow should be in Proposal
state and Completed
status:
$ kubectl get workflows
NAME STATE READY STATUS AGE
demo-container Proposal true Completed 12m
Progress the workflow to the Setup
state and wait for Completed
:
kubectl patch --type merge workflow demo-container --patch '{"spec": {"desiredState": "Setup"}}'
kubectl get workflows
NAME STATE READY STATUS AGE
demo-container Setup true Completed 12m
Progress to DataIn
:
kubectl patch --type merge workflow demo-container --patch '{"spec": {"desiredState": "DataIn"}}'
Then PreRun
:
kubectl patch --type merge workflow demo-container --patch '{"spec": {"desiredState": "PreRun"}}'
The PreRun
state will start the containers. Once the containers have started successfully, the
status will become Completed
. Your application is now running via the launcher pod, which is instructing
mpirun
to run your application on the worker pods. When the compute nodes were assigned to the
workflow in the Proposal
state (via the computes
resource), the NNF software traced the compute
nodes to their local NNF nodes. In this case, it means two NNF nodes were selected, and a worker pod
is running on each of them.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
demo-container-launcher-wcvcs 1/1 Running 0 5s
demo-container-worker-0 1/1 Running 0 5s
demo-container-worker-1 1/1 Running 0 5s
You can use kubectl to inspect the log to get your application's output:
$ kubectl logs demo-container-launcher-wcvcs
Defaulted container "nnf-container-example" out of: nnf-container-example, mpi-init-passwd (init), mpi-wait-for-worker-2 (init)
Warning: Permanently added '[demo-container-worker-1.demo-container.default.svc]:2222,[10.42.3.146]:2222' (ECDSA) to the list of known hosts.
Warning: Permanently added '[demo-container-worker-0.demo-container.default.svc]:2222,[10.42.2.118]:2222' (ECDSA) to the list of known hosts.
[1,0]<stdout>:Hello world from processor demo-container-worker-0, rank 0 out of 2 processors. NNF Storage path: /mnt/nnf/7f33f3bf-4559-4dda-892c-ff6148116a08-0, hostname: demo-container-worker-0
[1,0]<stdout>:Found indexed dir: /mnt/nnf/7f33f3bf-4559-4dda-892c-ff6148116a08-0/rabbit-node-1-0
[1,0]<stdout>:rank 0: test file: /mnt/nnf/7f33f3bf-4559-4dda-892c-ff6148116a08-0/rabbit-node-1-0/testfile
[1,1]<stdout>:Hello world from processor demo-container-worker-1, rank 1 out of 2 processors. NNF Storage path: /mnt/nnf/7f33f3bf-4559-4dda-892c-ff6148116a08-0, hostname: demo-container-worker-1
[1,1]<stdout>:Found indexed dir: /mnt/nnf/7f33f3bf-4559-4dda-892c-ff6148116a08-0/rabbit-node-2-0
[1,1]<stdout>:rank 1: test file: /mnt/nnf/7f33f3bf-4559-4dda-892c-ff6148116a08-0/rabbit-node-2-0/testfile
[1,0]<stdout>:rank 0: wrote file to '/mnt/nnf/7f33f3bf-4559-4dda-892c-ff6148116a08-0/rabbit-node-1-0/testfile'
[1,1]<stdout>:rank 1: wrote file to '/mnt/nnf/7f33f3bf-4559-4dda-892c-ff6148116a08-0/rabbit-node-2-0/testfile'
You can see that the Storage path is passed into the container application and printed to the log:
Hello world from processor demo-container-worker-1, rank 1 out of 2 processors. NNF Storage path: /mnt/nnf/100db033-c9f2-4cf8-b085-505aebf571c1-0, hostname: demo-container-worker-1
The next state is PostRun
. When containers have exited cleanly, the status state will become Completed
.
kubectl patch --type merge workflow demo-container --patch '{"spec": {"desiredState": "PostRun"}}'
$ kubectl get workflows
NAME STATE READY STATUS AGE
demo-container PostRun true Completed 13m
kubectl get pods
NAME READY STATUS RESTARTS AGE
demo-container-launcher-wcvcs 0/1 Completed 0 73s
demo-container-worker-0 1/1 Running 0 73s
demo-container-worker-1 1/1 Running 0 73s
You can then tear down the workflow:
kubectl patch --type merge workflow demo-container --patch '{"spec": {"desiredState": "Teardown"}}'
Once completed, the workflow and profile can be deleted. Again, you may need admin privileges to remove the NNF Container Profile.
kubectl delete -f nnf-container-example.yaml
Compute node applications will have the ability to communicate with container applications using ports. The container application can listen in on a port that is assigned to the NNF container. The port number is made available to the compute node application via environment variables.
Port assignment is not yet implemented to the containers running on the NNF nodes.