This repo contains the AWS ParallelCluster Chef cookbook used in AWS ParallelCluster.
The root folder of the cookbook repository can be considered the main cookbook, since it contains two files: Berksfile
and metadata.rb
.
These files are used by the CLI (build image time) and user-data.sh
code to vendor
the other sub-cookbooks and third party cookbooks.
The main cookbook does not contain any recipe, attribute or library. They are distributed in the functional cookbooks under the cookbooks
folder
defined as follows:
aws-parallelcluster-entrypoints
is the cookbook to define the external interface, contains recipes called by AMI builder, cluster setup, cluster update, etc. The recipes in this cookbook are called directly by the CLI or CI/CD. It orchestrates invocation of recipes/resources from other cookbooks;aws-parallelcluster-platform
OS packages and system configuration (directories, users, services, drivers);aws-parallelcluster-environment
AWS services configuration and usage, such as shared file systems, directory service and network interfaces;aws-parallelcluster-computefleet
slurm specific scaling logic, compute fleet scripts and daemons;aws-parallelcluster-awsbatch
files required to support AWS Batch as scheduler;aws-parallelcluster-slurm
files required to support Slurm as a scheduler and its dependencies (Munge, MySQL for accounting, etc), it depends byaws-parallelcluster-computefleet
;
Finally, some common code, such as source/script directories, usernames, package installer etc., are located in a cookbook
which every other cookbook depend on, that is aws-parallelcluster-shared
.
Each cookbook hosts recipes and resources, attributes, functions, files and templates belonging to its functional area.
Every cookbook contains ChefSpec and Kitchen tests for its code.
However, the code in a cookbook might require that some other code from a different cookbook not listed among dependencies,
to be executed as a prerequisite (test setup phase). For this reason aws-parallelcluster-tests
cookbook must depend on every other cookbook.
The test
folder contains Python unit tests files and Kitchen environment files.
The kitchen
folder contains utility files used when running Inspec tests locally.
The cookbooks/third-party
folder contains cookbooks from marketplace. They must be regularly updated and should not be modified by hand.
They have been pre-downloaded and stored in our repository to avoid contacting Chef Marketplace at AMI build time and cluster creation.
You can find more information about them in the cookbooks/third-party/THIRD-PARTY-LICENSES.txt
file.
ChefSpec is a unit testing framework for testing Chef cookbooks. It is very fast, and we use it to verify recipes with multiple branches (e.g. HeadNode vs ComputeNode) work as expected. They don't need virtual machines or cloud servers. They can be executed locally by executing:
cd cookbooks/aws-parallelcluster-platform
# run all the ChefSpec tests in a cookbook
chef exec rspec
# run a specific ChefSpec test
chef exec rspec ./spec/unit/recipes/sudo_config_spec.rb
They are automatically executed as GitHub actions, see definition in .github/ci.yml
.
Kitchen is used to automatically test cookbooks across any combination of platforms and test suites. It requires cinc-workstation to be installed on your environment:
curl -L https://omnitruck.cinc.sh/install.sh | sudo bash -s -- -P cinc-workstation -v 23
Make sure you have set a locale in your local shell environment, by exporting the LC_ALL
and LANG
variables,
for example by adding to your .bashrc
or .zshrc
the following and sourcing the file:
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
To speed up the transfer of files when kitchens are run on ec2 instances,
the transport selected is kitchen-transport-speedy.
To install kitchen-transport-speedy
in the kitchen embedded ruby environment please run: /opt/cinc-workstation/embedded/bin/gem install kitchen-transport-speedy
.
In order to test on docker containers, you also need docker
installed on your environment.
Please note that not all the tests can run on docker so in any case we need to validate our recipes on EC2.
You can use on_docker?
condition to skip execution of some recipes steps, resource actions or controls execution on docker.
Please look at "Known issues with docker" section of the README for specific issues (e.g. when running kitchen tests on non amd64
architectures).
kitchen.docker.sh
and kitchen.ec2.sh
help you run kitchen tests virtually without any further environment setup.
They are wrappers of the kitchen
command, so you can pass them all the options exposed by it. See kitchen --help
for more details.
You must do some initial setup on your AWS account in order to be able to use defaults from kitchen.ec2.sh
.
Default values are the following. Take a look at comments at the top of the script in order to understand how to use it.
: "${KITCHEN_AWS_REGION:=${AWS_DEFAULT_REGION:-eu-west-1}}"
: "${KITCHEN_KEY_NAME:=kitchen}"
: "${KITCHEN_SSH_KEY_PATH:="~/.ssh/${KITCHEN_KEY_NAME}-${KITCHEN_AWS_REGION}.pem"}"
: "${KITCHEN_AVAILABILITY_ZONE:=a}"
: "${KITCHEN_ARCHITECTURE:=x86_64}"
Both scripts can be run as follows:
kitchen.<ec2|docker>.sh <context> <kitchen parameters>
<context>
is your test context, like environment-config
or platform-install
.
For example ./kitchen.docker.sh platform-install test nvidia
will execute kitchen test
command, executing all the
tests for which the name starts with nvidia
prefix, in cookbooks/aws-parallelcluster-platform
directory on docker.
It is important to keep in mind that the parameter after the kitchen action is a pattern,
so it's important to choose the appropriate naming for kitchen tests suites.
For instance, we can use nvidia-<context>
for nvidia-related tests, so that they can be run separately or together.
However, we should not have nvidia
and nvidia-something
tests, as we wouldn't be able to run only the first one on all OSes.
Examples of submission are:
# Run supervisord kitchen test from file kitchen.platform-install.yml in cookbooks/aws-parallelcluster-platform directory,
# for all OSes (concurrency 5) and log level debug
# Note that in this case "supervisord" is a pattern, so all the tests starting with "supervisord" string in that yaml file will be executed.
./kitchen.docker.sh platform-install test supervisord -c 5 -l debug
# Run converge phase only of kitchen from file kitchen.environment-config.yml in cookbooks/aws-parallelcluster-environment directory, for alinux2 only.
# This is useful when you want to test recipe execution only.
# Once you have executed the converge step, you can for example execute multiple times the verify step, to validate the tests you are writing.
./kitchen.ec2.sh environment-config converge efa-alinux2
# Run verify phase only from file kitchen.platform-config.yml in cookbooks/aws-parallelcluster-platform directory,
# useful if you're modifing the test logic without touching the recipes code.
./kitchen.ec2.sh platform-config verify sudo -c 5
# Login to the instance created with the converge step
./kitchen.ec2.sh platform-config login sudo-alinux2
A context must have the format $subject-$phase
.
Supported phases are:
install
(on EC2 it defaults to a bare base AMI)config
(on EC2 it defaults to a ParallelCluster official AMI)
It will use kitchen.${context}.yml
in the specific cookbook, i.e. in cookbooks/aws-parallelcluster-$subject
dir.
You can override default values by setting environment variables in a .kitchen.env.sh
file to be created in the cookbook root folder.
Example of .kitchen.env.sh
file:
export KITCHEN_KEY_NAME=your-key # ED25519 key type (required for Ubuntu 22)
export KITCHEN_SSH_KEY_PATH=/path/your-key.pem
export KITCHEN_AWS_REGION=eu-west-1
export KITCHEN_SUBNET_ID=subnet-xxx
export KITCHEN_SECURITY_GROUP_ID=sg-your-group
export KITCHEN_INSTANCE_TYPE=t2.large
export KITCHEN_IAM_PROFILE=test-kitchen # required for tests with lifecycle hooks
The different kitchen.${context}.yml
files in the functional cookbooks contain a list of Inspec tests
for the different recipes and resources.
Every test specifies:
- the
run_list
that is the list of recipes to be executed as preparatory steps as part of thekitchen converge
phase:- the
recipe[aws-parallelcluster-tests::setup]
is a utility recipe that should be added to every test to prepare the environment and automatically execute resources and recipes listed as dependencies in thedependencies
attributes. - the
recipe[aws-parallelcluster-tests::test_resource]
is a utility recipe to simplify testing of the custom resource defined in theresource
attribute. Please checktest_resource
content to see which parameters you can pass to it.
- the
- the
verifier
with the list of controls to execute as part of thekitchen verify
phase, it's possible to use regex here, it can accept regular expressions in format/regex/
. - the node
attributes
that will be propagated in the test environment.resource
is a reserved attribute, used bytest_resource
recipe mentioned before.dependencies
is a reserved attribute, used bysetup
recipe mentioned before.cluster
structure permits to pass specific parameters to the test to simulate environment condition (i.e. dna.json configuration that should come from the CLI when executing the recipes in a real cluster)
Example of test definition:
- name: system_authentication
run_list:
- recipe[aws-parallelcluster-tests::setup]
- recipe[aws-parallelcluster-tests::test_resource]
verifier:
controls:
- /tag:config_system_authentication/
attributes:
resource: system_authentication:configure
dependencies:
- resource:system_authentication:setup
cluster:
directory_service:
enabled: "true"
node_type: HeadNode
When you execute a test like this with kitchen test
command it will execute the recipes or resources actions specified in the run_list
,
including dependencies
, will set cluster
attributes in the environment and at the end will execute the verify
steps by executing
the listed controls
.
The kitchen test
command will execute all the steps and will destroy the instance at the end. If you want to preserve the instance
you can execute the step one by one, check kitchen help
for more details.
As you can see in the .github/workflows/dokken-system-tests.yml
we are executing both install and config recipes as GitHub actions.
We execute install steps in the Kitchen Test Install
(to simulate AMI build) and then re-using the container to validate the config steps,
in the Kitchen Test Config
.
In our daily CI/CD we build an AMI (calling the aws-parallelcluster-entrypoints::install
recipe) and then execute kitchen tests on top of it.
Both CI/CD and GitHub actions use the kitchen.validate-config.yml
file in the root folder to validate the config steps.
If you look at it, you can see it runs all the inspec_tests
from all the cookbooks by executing the
controls matching the /tag:config/
regex.
verifier:
inspec_tests:
- /tmp/cookbooks/aws-parallelcluster-awsbatch/test
- /tmp/cookbooks/aws-parallelcluster-platform/test
- /tmp/cookbooks/aws-parallelcluster-environment/test
- /tmp/cookbooks/aws-parallelcluster-computefleet/test
...
controls:
- /tag:config/
This means that if you want a specific control to be executed as part of the CI/CD or GitHub action you should
use tag:config_
as prefix in the control name.
Note that not all the Inspec tests can run on GitHub and on the CI/CD because, as you can see in the kitchen.validate-config.yml
,
in this case the run_list
is defined as follows:
_run_list: &_run_list
- recipe[aws-parallelcluster-tests::setup]
- recipe[aws-parallelcluster-entrypoints::init]
- recipe[aws-parallelcluster-entrypoints::config]
- recipe[aws-parallelcluster-entrypoints::finalize]
- recipe[aws-parallelcluster-tests::tear_down]
without any attribute specified, so this could not match with the run_list
or the attributes
specified in the test definition.
So before adding the tag:config_
prefix to a control name, please be sure that it can run even without specific setup.
If you want to execute some kitchen tests at the end of the build-image process, to validate that a created AMI
contains what we expect, you can use the tag:install_
as prefix in the control name. They will be automatically executed by
Image builder at the end of the build-image process (search for tag:install_
in the CLI code to understand the details behind this mechanism).
Please note that if this test fails the build of the image will fail as well.
If you want to execute some kitchen tests as part of validate phase the build-image process without causing the build image to fail
you can use the tag:testami_
as prefix. These tests will be executed when the image has already been created
(search for tag:testami_
in the CLI code to understand the details behind this mechanism).
Please note that a test suite name can contain multiple tags (for instance, tag:install_tag:config_
), the code search for them
with a regex so it's not required to have them as prefix.
When you set the environment variable KITCHEN_SAVE_IMAGE=true
, a successful kitchen verify
phase will lead to
the Docker image being committed with the tag pcluster-${PHASE}/${INSTANCE_NAME}
.
For instance, if you successfully run
./kitchen.docker.sh platform-install test directories-alinux2
an image with tag pcluster-install/directories-alinux2:latest
will be saved.
To use it in a later Kitchen test, export KITCHEN_${PLATFORM}_IMAGE=<your_image>
.
For instance, to reuse the image from the example above, set KITCHEN_ALINUX2_IMAGE=pcluster-install/directories-alinux2
.
We are using this approach to re-use the docker image created by the Kitchen Test Install
in the following Kitchen Test Config
phase
as part of the GitHub actions.
The procedure described above also applies to EC2, with minor differences.
- To keep the EC2 instance running while the image is being cooked, refrain from using
kitchen test
orkitchen destroy
commands. Opt forkitchen verify
and destroy the instance once the AMI is ready. - Set
KITCHEN_${PLATFORM}_AMI=<ami_id>
to reuse the AMI. For instance,KITCHEN_ALINUX2_AMI=ami-nnnnnnnnnnnnn
.
This is useful when you need a long list of dependencies to be installed in the AMI (e.g. Slurm recipes) to verify configuration steps.
Kitchen lifecycle hooks allow running commands before and/or after any phase of Kitchen tests (create, converge, verify, or destroy).
We leverage this feature in Kitchen tests to create/destroy AWS resources (see kitchen.global.yaml
file.
For each phase, a generic run script executes custom
${THIS_DIR}/${KITCHEN_COOKBOOK_PATH}/test/hooks/${KITCHEN_PHASE}/${KITCHEN_SUITE_NAME}/${KITCHEN_HOOK}.sh
script, if it exists.
Example.
network_interfaces
Kitchen test suite in the aws-parallelcluster-environment
cookbook requires a network interface to be attached to the node.
cookbooks/aws-parallelcluster-environment/test/hooks/config/network_interfaces/post_create.sh
: creates ENI and attaches it to the instancecookbooks/aws-parallelcluster-environment/test/hooks/config/network_interfaces/pre_destroy.sh
: detaches and deletes ENI.
In the kitchen.global.yaml
we're configuring an environment.
In the environment file (i.e. test/environments/kitchen.rb
), for every value to pass and for every OS,
you have to define a line like: '<suite_name>-<variable_name>/<platform>' => 'placeholder'
. For instance:
default_attributes 'kitchen_hooks' => {
'ebs_mount-vol_array/alinux2' => 'placeholder',
...
}
These environment variables will be available to the kitchen tests as node attributes:
node['kitchen_hooks']['ebs_mount-vol_array/alinux2']
.
To permit to use these environment variables as parameters attributes you have to use th FROM-HOOK
keyword in the test suite definition.
e.g. resource: 'manage_ebs:mount {"shared_dir_array" : ["shared_dir"], "vol_array" : "FROM_HOOK-<suite_name>-<variable_name>"}'
This value will be automatically replaced, searching for the <suite_name>-<variable_name>/<platform>
in the environment.
You can find all the details of this mechanism in the test_resource.rb
.
Note: the value of the property to be replaced must be a string even if it's an array. It's up to the post_create script to define an array in the environment.
Running locally kitchen tests on system with CPU architecture other than amd64
(i.e. Apple Silicon that have arm64
)
may run in a known dokken issue (tracked as test-kitchen/kitchen-dokken#288).
All tests will fail with messages containing errors such as:
[qemu-x86_64: Could not open '/lib64/ld-linux-x86-64.so.2](https://stackoverflow.com/questions/71040681/qemu-x86-64-could-not-open-lib64-ld-linux-x86-64-so-2-no-such-file-or-direc)
To work around the issue, please ensure that the cinc-workstation
version is >= 23
, as it's the first one that has a
dokken version that features platform support.
Providing the correct platform configuration in ./kitchen.docker.yml
:
---
driver:
name: dokken
platform: linux/amd64
pull_platform_image: false # Use the local images, prevent pull of docker images from Docker Hub,
chef_version: 17 # Chef version aligned with the one used to build the images
chef_image: cincproject/cinc
...
is required but not enough if images for different CPU architectures already are present in the local docker cache. Local images of different architectures should be removed in order to work around the issue, then in subsequent executions dokken will pull the ones for the specified platform and use those, since there are no other than those for the correct architecture available locally.
Here are some examples to clean up local docker containers and images:
# removes running containers that may have been left dangling by previous
# executions of <your test prefix> test
docker rm \
$(docker container stop \
$(docker container ls -q --filter name='<your test prefix>*'))
# remove images from offending <your test prefix>
# you may want also to remove all dokken images
# (and safely remove all images, since subsequent executions will pull the
# required ones)
docker rmi \
$(docker images --format '{{.Repository}}:{{.Tag}}' \
| grep '<your test prefix>')
dokken expects that ~/.docker/config.json
contains an "auths"
key, fails in docker_config_creds
with NPE
otherwise, this issue is tracked in upstream as: test-kitchen/kitchen-dokken#290
On Ubuntu22, kitchen create
keeps trying to connect to the instance via ssh indefinitely.
If you interrupt it and try to run kitchen verify
, you see authentication failures.
This happens because Ubuntu22 does not accept authentication via RSA key. You need to re-create a key pair
using ED25519
key type.
If Kitchen doesn't detect your changes, try
berks shelf uninstall ${COOKBOOK_NAME}
Python tests are configured in tox.ini
file, including paths to python files.
If you move python files around, you need to fix python path accordingly.