False alarm category sheet: https://docs.google.com/spreadsheets/d/1_tmdu3MBnwHizVvgNGLjUsJYM6MGa1ppNa0YRHS64w0/edit?usp=sharing
- Golang
- Python dependencies
pip3 install -r requirements.txt
- k8s Kind cluster
go install sigs.k8s.io/[email protected]
- kubectl
- helm
To run the test:
python3 acto.py \
--config CONFIG, -c CONFIG
Operator port config path
--duration DURATION, -d DURATION
Number of hours to run
--preload-images [PRELOAD_IMAGES [PRELOAD_IMAGES ...]]
Docker images to preload into Kind cluster
--helper-crd HELPER_CRD
generated CRD file that helps with the input generation
--context CONTEXT Cached context data
--num-workers NUM_WORKERS
Number of concurrent workers to run Acto with
--dryrun Only generate test cases without executing them
{
"deploy": {
"method": "YAML",
"file": "data/rabbitmq-operator/operator.yaml",
"init": null
},
"crd_name": null,
"custom_fields": "data.rabbitmq-operator.prune",
"seed_custom_resource": "data/rabbitmq-operator/cr.yaml",
"analysis": {
"github_link": "https://github.com/rabbitmq/cluster-operator.git",
"commit": "f2ab5cecca7fa4bbba62ba084bfa4ae1b25d15ff",
"entrypoint": null,
"type": "RabbitmqCluster",
"package": "github.com/rabbitmq/cluster-operator/api/v1beta1"
}
}
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"deploy": {
"type": "object",
"properties": {
"method": {
"description": "One of three deploy methods [YAML HELM KUSTOMIZE]",
"type": "string"
},
"file": {
"description": "the deployment file",
"type": "string"
},
"init": {
"description": "any yaml to deploy for deploying the operator itself",
"type": "string"
}
},
"required": [
"method",
"file",
"init"
]
},
"crd_name": {
"description": "name of the CRD to test, optional if there is only one CRD",
"type": "string"
},
"custom_fields": {
"description": "file to guide the pruning",
"type": "string"
},
"seed_custom_resource": {
"description": "the seed CR file",
"type": "string"
},
"analysis": {
"type": "object",
"properties": {
"github_link": {
"description": "github link for the operator repo",
"type": "string"
},
"commit": {
"description": "specific commit hash of the repo",
"type": "string"
},
"entrypoint": {
"description": "directory of the main file",
"type": "string"
},
"type": {
"description": "the root type of the CR",
"type": "string"
},
"package": {
"description": "package of the root type",
"type": "string"
}
},
"required": [
"github_link",
"commit",
"entrypoint",
"type",
"package"
]
}
},
"required": [
"deploy",
"crd_name",
"custom_fields",
"seed_custom_resource",
"analysis"
]
}
-
(A Known Issue of Kind) Failed cluster creation when using the multiple worker functionality by specifying
--num-workers
.This may be caused by running out of inotify resources. Resource limits are defined by fs.inotify.max_user_watches and fs.inotify.max_user_instances system variables. For example, in Ubuntu these default to 8192 and 128 respectively, which is not enough to create a cluster with many nodes.
To increase these limits temporarily, run the following commands on the host:
sudo sysctl fs.inotify.max_user_watches=524288 sudo sysctl fs.inotify.max_user_instances=512
To make the changes persistent, edit the file /etc/sysctl.conf and add these lines:
fs.inotify.max_user_watches = 524288 fs.inotify.max_user_instances = 512
rabbitmq-operator:
python3 acto.py --config data/rabbitmq-operator/config.json
--num-workers 4
Reproduction utility enables Acto to reproduce previously found bugs by taking a folder that contains previously generated CRs (i.e. mutated files) as input and directly deploying each CR. In this way, to reproduce a bug in a certain operator, Acto can just run a single testcase instead of running all testcases for that operator.
Usage:
python3 reproduce.py --reproduce-dir <path to the folder containing CRs> --config <path to corresponding config.json>
Acto aims to automate the E2E testing as much as possible to minimize users' labor.
Currently, porting operators still requires some manual effort, we need:
- A way to deploy the operator, the deployment method needs to handle all the necessary prerequisites to deploy the operator, e.g. CRD, namespace creation, RBAC, etc. Current we support three deploy methods:
yaml
,helm
, andkustomize
. For example, rabbitmq-operator usesyaml
for deployment, and the example is shown here - A seed CR yaml serving as the initial cr input. This can be any valid CR for your application. Example
Acto supports multi-cluster parallelism, which in theory makes Acto scale perfectly. However, it turns out to be not that perfect when I tried to run Acto on a CloudLab 220-g5 machine.
The machine has 40 cores, 192 GB RAM, and a 500 GB SSD. I tried to run Acto with 8 clusters, and all of them failed, because it takes the operators more than 5 minutes to get ready. Every cluster becomes extremely slow, liveness probe fails very often causing pods being recreated all the time. I found out that Acto is IO-bound, meaning the performance is bottled-necked by the speed of disk READ/WRITE.
We usually have 2 GB of images to preload into each cluster node, and we have 4 nodes for each cluster. With 8 clusters, it means the machine needs to preload ~20GB * 8 content to the disk, and then read some of them into memory to start running 4 * 8 Kubernetes. During all this time, CPU and memory are moslty idle and SSD is at full-load all the time.
To mitigate this issue, there are two directions:
- Reduce the size of the preload images
- There are some redundent images in the image archive, we can remove them
- Reduce the number of nodes in the cluster
- Switch to k3s, which is a lightweight version of k8s
- Mount docker's workdir in tmpfs on the RAM
- Since we have 192 GB RAM on the machine, we can put docker's workdir in RAM to avoid the IO bottleneck
Golang does not have many tools for measuring code coverage except the native go test
util.
However, go test
does not support measuring code coverage for E2E tests, it only supports
measuring code coverage on unit/integration level.
We use a series of hacks to measure the code coverage for Acto:
- Create a new file called
main_test.go
under the same directory with themain.go
, themain_test.go
should contain one unittest which calls themain
function. In this way, we created a virtual unittest which just runs themain
function, essentially an E2E test. - Next, we need to compile this unittest into a binary and build a docker image on it. Luckily,
go test
supports-c
flag which compiles the unittest into a binary to be run later instead of running it immediately. We then modify theDockerfile
to change the build command fromgo build ...
togo test -c ...
with approriate flags. Along with the build flags, we also pass in the test flags such as-coverpkg -cover
. - Having the test binary is not enough, we need to pass in a flag when running the binary to redirect
the coverage information to a file. To do this, we need to create a shell script which exec the binary
with the
-test.coverprofile=/tmp/profile/cass-operator-``date +%s%N``.out
flag and make this shell script the entrypoint of the docker image.