Implement CI for RHOBS #418

vprashar2929 · 2023-02-28T12:07:02Z

This PR contains some changes which are related to implementing CI for RHOBS
The workflow test.yaml performs the following tasks:

Install docker on ubuntu github runner.
Get OC CLI binary.
Deploy microshift docker container.
Deploy some of the dependencies.
Run the test script which is present in tests/ci_test.sh

The ci_test.sh primarily deploys each resources of RHOBS one at a time on microshift cluster and then do a pod health check just to verify that pod is in running state after applying the required manifests. This is related to RHOBS-552. Currently it covers only basic validation. The in-depth validation can be covered under post-deploy job

This is still in progress so any comments/suggestions or contributions are welcome😄
If you want to test with any breakable commit then just change the following syntax inside .github/workflows/test.yaml and push with your breakable commit/changes to verify

# on:
#  workflow_run:
#    workflows: ["CI build"]
#    types:
#    - completed

# For debugging the pipeline
on: [push]

Note: Currently the test pipeline takes approx 30 min to complete. Will try to reduce this number if possible

periklis

Awesome ❤️ for doing this long-standing work! I really appreciated all the effort put into this. I have a couple of improvement comments to make it sustainable and easy to extend for our other teams (e.g. my folks and myself from the log storage team and the tracing folks).

Let's try to squeeze as much as possible out of existing GitHub actions for setting up microshift, docker, oc binary. Using actions of the shelf improves caching and in turn build times.
Much of the ci-test.sh can be simplified with existing commands as oc rollout or the older oc wait. Consider to re-use as much as possible from: observatorium e2e.sh
Since we start a huge system into a cluster, it would super helpful to have a must-gather in case of erroneous build runs, e.g. have a look here observatorium must-gather func

periklis · 2023-02-28T13:17:17Z

.github/workflows/test.yml

+    - name: Print CWD
+      run: echo "$PWD"
+    - name: Configure env
+      run: |
+        sudo apt-get update
+        sudo apt-get install \
+            ca-certificates \
+            curl \
+            wget \
+            gnupg \
+            lsb-release
+        sudo mkdir -m 0755 -p /etc/apt/keyrings
+        curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
+        echo \
+        "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
+        $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
+        sudo chmod a+r /etc/apt/keyrings/docker.gpg
+        sudo apt-get update
+    - name: Install Docker
+      run: |
+        sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
+    - name: Check docker is running
+      run: |
+        sudo systemctl status docker
+    - name: Get Docker info
+      run: |
+        docker info
+    - name: Get OC CLI Binary
+      run: |
+        wget https://github.com/okd-project/okd/releases/download/4.12.0-0.okd-2023-02-18-033438/openshift-client-linux-4.12.0-0.okd-2023-02-18-033438.tar.gz
+        tar xzvf openshift-client-linux-4.12.0-0.okd-2023-02-18-033438.tar.gz
+        sudo mv oc kubectl /usr/local/bin
+    - name: Run OC Binary
+      run: oc --help
+    - name: Spin up microshift container
+      run: |
+        docker run -d --name microshift --privileged -v microshift-data:/var/lib -p 6443:6443 -p 80:80 -p 443:443 quay.io/microshift/microshift-aio:latest
+        sleep 60s
+    - name: Export the kubeconfig
+      run: docker exec -i microshift cat /var/lib/microshift/resources/kubeadmin/kubeconfig > tests/kubeconfig
+    - name: List the kubeconfig file
+      run: ls tests


Can you elaborate if these steps can/cannot simply replaced with using microshifts github action? https://github.com/marketplace/actions/microshift-openshift-cluster

I tried using github actions to deploy Microshift but seems like there are some issue with Ubuntu and as per this open issue I thought of going with docker deployment

Is it mandatory to use Ubuntu then and not simply the default OS assumed by the microshift action?

Currently Microshift only works on RHEL family of OS and they don't have multi-os support. Also github actions supports ubuntu as only linux distribution as of now😅

I did a short read-up on microshift and I believe it is not worth the effort to use it as the base platform for our RHOBS CI. It is far away from having support of non RHEL-OS's as you say but it will stay too limited from an resource perspective (by design AFAIU).

For our CI I suggest to use kind instead, it is robust and its main purpose is supporting the development workflow. It is super easy to spin-up with a github action (See https://github.com/marketplace?type=actions&query=kind+)

For example we use is building our upstream operators (see Loki Operator quickstart.sh or Loki Operator GH Action).

I did some exploration around using kind as well. The manifests which are generated currently are OpenShift based templates and some relies on OpenShift-specific resources/features which are not available on standard Kubernetes distribution. So deploying those templates straight forward would not be easy. Some modifications will be there

Can you list the manifests please? The only show-stopper using kind is oc process but even this is can used as oc process --local -f without a need of the OpenShift-APIServer. If I assume right you mean manifests like ServiceMonitor right? If yes we can still use kind by simply applying the CRDs from the prometheus-operator (See observatorium e2e.sh)

CRD's are no problem. The manifests after applying required CRD's deploy fine. I am not able to recall exact issue here but there was one instance in case of deploying compact-tls secret which is a service serving certificate secret that we use when deploying thanos-compact. AFAIK for deploying on Kubernetes there is no straight forward way to achieve this.

My only Intention here to use Microshift was to play less around cluster deployment and deploying the generated manifests in a OCP kind of environment as ultimately these manifests will be used to deploy RHOBS on stag/prod env and it can give us more or less actual env like confidence.

periklis · 2023-02-28T13:19:07Z

.github/workflows/test.yml

+      run: docker exec -i microshift cat /var/lib/microshift/resources/kubeadmin/kubeconfig > tests/kubeconfig
+    - name: List the kubeconfig file
+      run: ls tests
+    - name: Deploy pre-requisites


Any way to offload these inline YAML into static files?

Yes definitely will move them into static files

periklis · 2023-02-28T13:22:33Z

tests/ci_test.sh

+check_pod_status(){
+    podname=$1
+    namespace=$2
+    retry=3
+    while [ $retry -ne 0 ]
+    do
+        containerStatuses=$(oc get $podname -n $namespace -o jsonpath='{.status.containerStatuses[*].state}')
+        if [[ $containerStatuses == *'waiting'* || -z $containerStatuses ]];
+        then
+
+            log_error "Output: $containerStatuses"
+            log_error "Retrying again..."
+        else
+            log_info "Status of $podname is healthy inside $namespace namespace"
+            return
+        fi
+        sleep 30
+        ((retry--))
+    done
+    log_error "Retry exhausted!!!"
+    teardown
+    exit 1
+}


Checking rollouts is better served using the oc rollout status deployment... and oc rollout status statefulset commands. This will eliminate maintaining many bash script lines.

Thats the excellent point. I never tried checking status using oc rollout status deployment or oc rollout status statefulset. Will definitely try these and make changes accordingly in script

periklis · 2023-02-28T13:24:10Z

tests/ci_test.sh

+destroy(){
+    depname=$1
+    namespace=$2
+    log_info "Destroying $depname resources inside $namespace namespace"
+    if [ $depname == 'memcached' ];
+    then
+        return
+    fi
+    oc delete statefulsets -n $namespace -l app.kubernetes.io/name=$depname 1> /dev/null
+    oc delete deployment -n $namespace -l app.kubernetes.io/name=$depname 1> /dev/null
+    oc delete pvc -n $namespace --all=true 1> /dev/null
+    sleep 30
+}


There is no need to destroy each statefulset/deployment individually, you can reuse the tempalte for this:

oc process -f observatorium-metrics.template.yam | oc delete -f -

So my thought behind destroying statefulset/deployment is because of resource limit on GitHub hosted runners i.e 2-core CPU 7 GB of RAM and 14 GB of SSD. Due to this we cannot deploy all components of RHOBS at once. That's why I had to deploy each component one by one verify it and then destroy it so that sufficient room is available for next component to deploy.

I agree insufficient resources are always PITA. However spinning up partially the components does not account for the fact that the whole system can be used for a test. The current approach only tests that they start which is valuable as such but very limited. IMHO we should aim using a CI environment that allows us to spin up the whole system (with super limited CPU/Mem maybe as we do in observatorium e2e on CirceCI) and use obervatorium/up to do at minimum blackbox testing (send metrics/logs in, read metrics/logs out).

In addition if we cannot get more power from GitHub Actions, then maybe piggy-packing on observatorium's e2e test setup on CircleCI is maybe a good pivot.

Absolutely the current approach only tests that pods spin up in healthy state on the manifests changes and provide a quick validation against those changes so that we don't get any surprises once it is deployed in stage env.
Initially I also had the same thought that there should be some kind of e2e test which touch every component but due to GitHub actions limitations I scrapped that.
I think the runtest inside post deploy job does that what you are suggesting of using obervatorium/up to send and read metrics/logs.
If we wish to integrate runtest like checks in CI then it can also be done but for that as you mentioned correctly we may need to move away from GitHub Actions due to its limitations and explore around CircleCI.
Let's see what are team's thoughts on moving away from GitHub Actions.

IMHO we should not limit our CI to what the post-deploy jobs do and don't. The post-deploy jobs currently do tests that should have been part of the CI first. The actual purpose of the post-deploy jobs on app-interface is to guard auto-promotion of changes from stage to production and a simply blackbox test (as currently implemented) is not sufficient. Thus it is not a limit to have this blackbox test in our CI here in Github or Circle CI.

To pave the road here, I suggest:

Pull simple blackbox tests (like the one with up) here into our repo's CI. This ensures that we have some confidence in automated testing our Pull Requests before they land. Of course this means under the limits of our CI (e.g. microshift is not a fully-capable AppSRE-managed cluster). However it will give some notion of it won't break in essential steps (e.g. oc process missing params, oc apply missing resources, wrong configuration breaking up's in/out testing).

In a follow-up step we need to define what type of tests really guard the promotion from stage into production and implement them in the post-deploy jobs (i would keep the up -test there too to ensure that in/out testing works on the target cluster platform too), e.g.
a. Canary-testing that requests do not break our SLO's (e.g. availability, latency) using avalance or loki-canary.
b. Canary-testing all tenants have still access to their metrics/logs

vprashar2929 · 2023-02-28T18:11:51Z

Awesome ❤️ for doing this long-standing work! I really appreciated all the effort put into this. I have a couple of improvement comments to make it sustainable and easy to extend for our other teams (e.g. my folks and myself from the log storage team and the tracing folks).

Let's try to squeeze as much as possible out of existing GitHub actions for setting up microshift, docker, oc binary. Using actions of the shelf improves caching and in turn build times.

Much of the ci-test.sh can be simplified with existing commands as oc rollout or the older oc wait. Consider to re-use as much as possible from: observatorium e2e.sh

Since we start a huge system into a cluster, it would super helpful to have a must-gather in case of erroneous build runs, e.g. have a look here observatorium must-gather func

Thanks @periklis 😊 for taking time to go through this. Your suggestions are insightful and top notch 🙇🏼‍♂️. I didn't realised the thought of extending it for other teams. I agree with you it will make CI super easy to use.
The above links are very helpful and will take reference from it.

douglascamata · 2023-04-05T13:42:56Z

@vprashar2929 is this still valid now that #421 is merged? If not, I think we should close it.

vprashar2929 · 2023-04-05T13:52:09Z

Yes we can close this PR

vprashar2929 added 3 commits February 28, 2023 16:22

Implement CI for RHOBS

a8aec86

Fix new line

29ed106

Remove older ci workflow

ac7fc92

periklis requested changes Feb 28, 2023

View reviewed changes

vprashar2929 marked this pull request as draft March 9, 2023 06:49

vprashar2929 mentioned this pull request Mar 9, 2023

Implement CI using CircleCI #421

Merged

vprashar2929 closed this Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement CI for RHOBS #418

Implement CI for RHOBS #418

vprashar2929 commented Feb 28, 2023

periklis left a comment

periklis Feb 28, 2023

vprashar2929 Feb 28, 2023

periklis Feb 28, 2023

vprashar2929 Mar 1, 2023

periklis Mar 1, 2023

vprashar2929 Mar 1, 2023

periklis Mar 1, 2023

vprashar2929 Mar 1, 2023

periklis Feb 28, 2023

vprashar2929 Feb 28, 2023

periklis Feb 28, 2023

vprashar2929 Feb 28, 2023

periklis Feb 28, 2023

vprashar2929 Feb 28, 2023

periklis Feb 28, 2023

periklis Feb 28, 2023

vprashar2929 Mar 1, 2023

periklis Mar 1, 2023

vprashar2929 commented Feb 28, 2023

douglascamata commented Apr 5, 2023

vprashar2929 commented Apr 5, 2023

Implement CI for RHOBS #418

Implement CI for RHOBS #418

Conversation

vprashar2929 commented Feb 28, 2023

periklis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vprashar2929 commented Feb 28, 2023

douglascamata commented Apr 5, 2023

vprashar2929 commented Apr 5, 2023