Acto: Automatic, Continuous Testing for (Kubernetes/OpenShift) Operators

Prerequisites

Golang
Python dependencies
- pip3 install -r requirements.txt
k8s Kind cluster
- go install sigs.k8s.io/kind@v0.14.0
kubectl
- Installation
helm
- Installing Helm

Usage

To run the test:

python3 acto.py \
  --config CONFIG, -c CONFIG
                        Operator port config path
  --duration DURATION, -d DURATION
                        Number of hours to run
  --preload-images [PRELOAD_IMAGES [PRELOAD_IMAGES ...]]
                        Docker images to preload into Kind cluster
  --helper-crd HELPER_CRD
                        generated CRD file that helps with the input generation
  --context CONTEXT     Cached context data
  --num-workers NUM_WORKERS
                        Number of concurrent workers to run Acto with
  --dryrun              Only generate test cases without executing them

Operator config example

{
    "deploy": {
        "method": "YAML",
        "file": "data/rabbitmq-operator/operator.yaml",
        "init": null
    },
    "crd_name": null,
    "custom_fields": "data.rabbitmq-operator.prune",
    "seed_custom_resource": "data/rabbitmq-operator/cr.yaml",
    "analysis": {
        "github_link": "https://github.com/rabbitmq/cluster-operator.git",
        "commit": "f2ab5cecca7fa4bbba62ba084bfa4ae1b25d15ff",
        "entrypoint": null,
        "type": "RabbitmqCluster",
        "package": "github.com/rabbitmq/cluster-operator/api/v1beta1"
    }
}

JSON schema for writing the operator porting config

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "type": "object",
  "properties": {
    "deploy": {
      "type": "object",
      "properties": {
        "method": {
          "description": "One of three deploy methods [YAML HELM KUSTOMIZE]",
          "type": "string"
        },
        "file": {
          "description": "the deployment file",
          "type": "string"
        },
        "init": {
          "description": "any yaml to deploy for deploying the operator itself",
          "type": "string"
        }
      },
      "required": [
        "method",
        "file",
        "init"
      ]
    },
    "crd_name": {
      "description": "name of the CRD to test, optional if there is only one CRD",
      "type": "string"
    },
    "custom_fields": {
      "description": "file to guide the pruning",
      "type": "string"
    },
    "seed_custom_resource": {
      "description": "the seed CR file",
      "type": "string"
    },
    "analysis": {
      "type": "object",
      "properties": {
        "github_link": {
          "description": "github link for the operator repo",
          "type": "string"
        },
        "commit": {
          "description": "specific commit hash of the repo",
          "type": "string"
        },
        "entrypoint": {
          "description": "directory of the main file",
          "type": "string"
        },
        "type": {
          "description": "the root type of the CR",
          "type": "string"
        },
        "package": {
          "description": "package of the root type",
          "type": "string"
        }
      },
      "required": [
        "github_link",
        "commit",
        "entrypoint",
        "type",
        "package"
      ]
    }
  },
  "required": [
    "deploy",
    "crd_name",
    "custom_fields",
    "seed_custom_resource",
    "analysis"
  ]
}

Known Issues

(A Known Issue of Kind) Failed cluster creation when using the multiple worker functionality by specifying --num-workers.

This may be caused by running out of inotify resources. Resource limits are defined by fs.inotify.max_user_watches and fs.inotify.max_user_instances system variables. For example, in Ubuntu these default to 8192 and 128 respectively, which is not enough to create a cluster with many nodes.

To increase these limits temporarily, run the following commands on the host:
```
sudo sysctl fs.inotify.max_user_watches=524288
sudo sysctl fs.inotify.max_user_instances=512
```
To make the changes persistent, edit the file /etc/sysctl.conf and add these lines:
```
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 512
```

Example:

rabbitmq-operator:

python3 acto.py --seed data/rabbitmq-operator/config.json
                --num-workers 4

Porting operators

Acto aims to automate the E2E testing as much as possible to minimize users' labor.

Currently, porting operators still requires some manual effort, we need:

A way to deploy the operator, the deployment method needs to handle all the necessary prerequisites to deploy the operator, e.g. CRD, namespace creation, RBAC, etc. Current we support three deploy methods: yaml, helm, and kustomize. For example, rabbitmq-operator uses yaml for deployment, and the example is shown here
A seed CR yaml serving as the initial cr input. This can be any valid CR for your application. Example

Known Limitations

#121
#120

Next Steps

#131

Running Acto on good machines

Acto supports multi-cluster parallelism, which in theory makes Acto scale perfectly. However, it turns out to be not that perfect when I tried to run Acto on a CloudLab 220-g5 machine.

The machine has 40 cores, 192 GB RAM, and a 500 GB SSD. I tried to run Acto with 8 clusters, and all of them failed, because it takes the operators more than 5 minutes to get ready. Every cluster becomes extremely slow, liveness probe fails very often causing pods being recreated all the time. I found out that Acto is IO-bound, meaning the performance is bottled-necked by the speed of disk READ/WRITE.

We usually have 2 GB of images to preload into each cluster node, and we have 4 nodes for each cluster. With 8 clusters, it means the machine needs to preload ~20GB * 8 content to the disk, and then read some of them into memory to start running 4 * 8 Kubernetes. During all this time, CPU and memory are moslty idle and SSD is at full-load all the time.

To mitigate this issue, there are two directions:

Reduce the size of the preload images

There are some redundent images in the image archive, we can remove them
Reduce the number of nodes in the cluster
Switch to k3s, which is a lightweight version of k8s

Mount docker's workdir in tmpfs on the RAM

Since we have 192 GB RAM on the machine, we can put docker's workdir in RAM to avoid the IO bottleneck

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Acto: Automatic, Continuous Testing for (Kubernetes/OpenShift) Operators

Prerequisites

Usage

Operator config example

JSON schema for writing the operator porting config

Known Issues

Example:

Porting operators

Known Limitations

Next Steps

Running Acto on good machines

Files

README.md

Latest commit

History

README.md

File metadata and controls

Acto: Automatic, Continuous Testing for (Kubernetes/OpenShift) Operators

Prerequisites

Usage

Operator config example

JSON schema for writing the operator porting config

Known Issues

Example:

Porting operators

Known Limitations

Next Steps

Running Acto on good machines