Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement CI for RHOBS #418

Closed
wants to merge 3 commits into from
Closed

Conversation

vprashar2929
Copy link
Contributor

This PR contains some changes which are related to implementing CI for RHOBS
The workflow test.yaml performs the following tasks:

  1. Install docker on ubuntu github runner.
  2. Get OC CLI binary.
  3. Deploy microshift docker container.
  4. Deploy some of the dependencies.
  5. Run the test script which is present in tests/ci_test.sh

The ci_test.sh primarily deploys each resources of RHOBS one at a time on microshift cluster and then do a pod health check just to verify that pod is in running state after applying the required manifests. This is related to RHOBS-552. Currently it covers only basic validation. The in-depth validation can be covered under post-deploy job

This is still in progress so any comments/suggestions or contributions are welcome😄
If you want to test with any breakable commit then just change the following syntax inside .github/workflows/test.yaml and push with your breakable commit/changes to verify

# on:
#  workflow_run:
#    workflows: ["CI build"]
#    types:
#    - completed

# For debugging the pipeline
on: [push]

Note: Currently the test pipeline takes approx 30 min to complete. Will try to reduce this number if possible

Copy link
Contributor

@periklis periklis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome ❤️ for doing this long-standing work! I really appreciated all the effort put into this. I have a couple of improvement comments to make it sustainable and easy to extend for our other teams (e.g. my folks and myself from the log storage team and the tracing folks).

  • Let's try to squeeze as much as possible out of existing GitHub actions for setting up microshift, docker, oc binary. Using actions of the shelf improves caching and in turn build times.
  • Much of the ci-test.sh can be simplified with existing commands as oc rollout or the older oc wait. Consider to re-use as much as possible from: observatorium e2e.sh
  • Since we start a huge system into a cluster, it would super helpful to have a must-gather in case of erroneous build runs, e.g. have a look here observatorium must-gather func

Comment on lines +25 to +66
- name: Print CWD
run: echo "$PWD"
- name: Configure env
run: |
sudo apt-get update
sudo apt-get install \
ca-certificates \
curl \
wget \
gnupg \
lsb-release
sudo mkdir -m 0755 -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo chmod a+r /etc/apt/keyrings/docker.gpg
sudo apt-get update
- name: Install Docker
run: |
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
- name: Check docker is running
run: |
sudo systemctl status docker
- name: Get Docker info
run: |
docker info
- name: Get OC CLI Binary
run: |
wget https://github.com/okd-project/okd/releases/download/4.12.0-0.okd-2023-02-18-033438/openshift-client-linux-4.12.0-0.okd-2023-02-18-033438.tar.gz
tar xzvf openshift-client-linux-4.12.0-0.okd-2023-02-18-033438.tar.gz
sudo mv oc kubectl /usr/local/bin
- name: Run OC Binary
run: oc --help
- name: Spin up microshift container
run: |
docker run -d --name microshift --privileged -v microshift-data:/var/lib -p 6443:6443 -p 80:80 -p 443:443 quay.io/microshift/microshift-aio:latest
sleep 60s
- name: Export the kubeconfig
run: docker exec -i microshift cat /var/lib/microshift/resources/kubeadmin/kubeconfig > tests/kubeconfig
- name: List the kubeconfig file
run: ls tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate if these steps can/cannot simply replaced with using microshifts github action? https://github.com/marketplace/actions/microshift-openshift-cluster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using github actions to deploy Microshift but seems like there are some issue with Ubuntu and as per this open issue I thought of going with docker deployment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it mandatory to use Ubuntu then and not simply the default OS assumed by the microshift action?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently Microshift only works on RHEL family of OS and they don't have multi-os support. Also github actions supports ubuntu as only linux distribution as of now😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a short read-up on microshift and I believe it is not worth the effort to use it as the base platform for our RHOBS CI. It is far away from having support of non RHEL-OS's as you say but it will stay too limited from an resource perspective (by design AFAIU).

For our CI I suggest to use kind instead, it is robust and its main purpose is supporting the development workflow. It is super easy to spin-up with a github action (See https://github.com/marketplace?type=actions&query=kind+)

For example we use is building our upstream operators (see Loki Operator quickstart.sh or Loki Operator GH Action).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some exploration around using kind as well. The manifests which are generated currently are OpenShift based templates and some relies on OpenShift-specific resources/features which are not available on standard Kubernetes distribution. So deploying those templates straight forward would not be easy. Some modifications will be there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you list the manifests please? The only show-stopper using kind is oc process but even this is can used as oc process --local -f without a need of the OpenShift-APIServer. If I assume right you mean manifests like ServiceMonitor right? If yes we can still use kind by simply applying the CRDs from the prometheus-operator (See observatorium e2e.sh)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRD's are no problem. The manifests after applying required CRD's deploy fine. I am not able to recall exact issue here but there was one instance in case of deploying compact-tls secret which is a service serving certificate secret that we use when deploying thanos-compact. AFAIK for deploying on Kubernetes there is no straight forward way to achieve this.

My only Intention here to use Microshift was to play less around cluster deployment and deploying the generated manifests in a OCP kind of environment as ultimately these manifests will be used to deploy RHOBS on stag/prod env and it can give us more or less actual env like confidence.

run: docker exec -i microshift cat /var/lib/microshift/resources/kubeadmin/kubeconfig > tests/kubeconfig
- name: List the kubeconfig file
run: ls tests
- name: Deploy pre-requisites
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way to offload these inline YAML into static files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes definitely will move them into static files

Comment on lines +13 to +35
check_pod_status(){
podname=$1
namespace=$2
retry=3
while [ $retry -ne 0 ]
do
containerStatuses=$(oc get $podname -n $namespace -o jsonpath='{.status.containerStatuses[*].state}')
if [[ $containerStatuses == *'waiting'* || -z $containerStatuses ]];
then

log_error "Output: $containerStatuses"
log_error "Retrying again..."
else
log_info "Status of $podname is healthy inside $namespace namespace"
return
fi
sleep 30
((retry--))
done
log_error "Retry exhausted!!!"
teardown
exit 1
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking rollouts is better served using the oc rollout status deployment... and oc rollout status statefulset commands. This will eliminate maintaining many bash script lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats the excellent point. I never tried checking status using oc rollout status deployment or oc rollout status statefulset. Will definitely try these and make changes accordingly in script

Comment on lines +68 to +80
destroy(){
depname=$1
namespace=$2
log_info "Destroying $depname resources inside $namespace namespace"
if [ $depname == 'memcached' ];
then
return
fi
oc delete statefulsets -n $namespace -l app.kubernetes.io/name=$depname 1> /dev/null
oc delete deployment -n $namespace -l app.kubernetes.io/name=$depname 1> /dev/null
oc delete pvc -n $namespace --all=true 1> /dev/null
sleep 30
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to destroy each statefulset/deployment individually, you can reuse the tempalte for this:

oc process -f  observatorium-metrics.template.yam | oc delete -f -

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my thought behind destroying statefulset/deployment is because of resource limit on GitHub hosted runners i.e 2-core CPU 7 GB of RAM and 14 GB of SSD. Due to this we cannot deploy all components of RHOBS at once. That's why I had to deploy each component one by one verify it and then destroy it so that sufficient room is available for next component to deploy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree insufficient resources are always PITA. However spinning up partially the components does not account for the fact that the whole system can be used for a test. The current approach only tests that they start which is valuable as such but very limited. IMHO we should aim using a CI environment that allows us to spin up the whole system (with super limited CPU/Mem maybe as we do in observatorium e2e on CirceCI) and use obervatorium/up to do at minimum blackbox testing (send metrics/logs in, read metrics/logs out).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition if we cannot get more power from GitHub Actions, then maybe piggy-packing on observatorium's e2e test setup on CircleCI is maybe a good pivot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely the current approach only tests that pods spin up in healthy state on the manifests changes and provide a quick validation against those changes so that we don't get any surprises once it is deployed in stage env.
Initially I also had the same thought that there should be some kind of e2e test which touch every component but due to GitHub actions limitations I scrapped that.
I think the runtest inside post deploy job does that what you are suggesting of using obervatorium/up to send and read metrics/logs.
If we wish to integrate runtest like checks in CI then it can also be done but for that as you mentioned correctly we may need to move away from GitHub Actions due to its limitations and explore around CircleCI.
Let's see what are team's thoughts on moving away from GitHub Actions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO we should not limit our CI to what the post-deploy jobs do and don't. The post-deploy jobs currently do tests that should have been part of the CI first. The actual purpose of the post-deploy jobs on app-interface is to guard auto-promotion of changes from stage to production and a simply blackbox test (as currently implemented) is not sufficient. Thus it is not a limit to have this blackbox test in our CI here in Github or Circle CI.

To pave the road here, I suggest:

  1. Pull simple blackbox tests (like the one with up) here into our repo's CI. This ensures that we have some confidence in automated testing our Pull Requests before they land. Of course this means under the limits of our CI (e.g. microshift is not a fully-capable AppSRE-managed cluster). However it will give some notion of it won't break in essential steps (e.g. oc process missing params, oc apply missing resources, wrong configuration breaking up's in/out testing).
  2. In a follow-up step we need to define what type of tests really guard the promotion from stage into production and implement them in the post-deploy jobs (i would keep the up -test there too to ensure that in/out testing works on the target cluster platform too), e.g.
    a. Canary-testing that requests do not break our SLO's (e.g. availability, latency) using avalance or loki-canary.
    b. Canary-testing all tenants have still access to their metrics/logs

@vprashar2929
Copy link
Contributor Author

Awesome ❤️ for doing this long-standing work! I really appreciated all the effort put into this. I have a couple of improvement comments to make it sustainable and easy to extend for our other teams (e.g. my folks and myself from the log storage team and the tracing folks).

  • Let's try to squeeze as much as possible out of existing GitHub actions for setting up microshift, docker, oc binary. Using actions of the shelf improves caching and in turn build times.
  • Much of the ci-test.sh can be simplified with existing commands as oc rollout or the older oc wait. Consider to re-use as much as possible from: observatorium e2e.sh
  • Since we start a huge system into a cluster, it would super helpful to have a must-gather in case of erroneous build runs, e.g. have a look here observatorium must-gather func

Thanks @periklis 😊 for taking time to go through this. Your suggestions are insightful and top notch 🙇🏼‍♂️. I didn't realised the thought of extending it for other teams. I agree with you it will make CI super easy to use.
The above links are very helpful and will take reference from it.

@vprashar2929 vprashar2929 marked this pull request as draft March 9, 2023 06:49
@douglascamata
Copy link
Contributor

@vprashar2929 is this still valid now that #421 is merged? If not, I think we should close it.

@vprashar2929
Copy link
Contributor Author

Yes we can close this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants