Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: BATS: Make k8s/helm-install-rancher pass on macOS #7069

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

mook-as
Copy link
Contributor

@mook-as mook-as commented Jun 18, 2024

This changes the k8s/helm-install-rancher BATS test to not wait with helm, but instead examine the deployed pods for log lines (and otherwise wait for objects manually). This lets the test pass in CI, because macOS CI hardware appears to be slower.

Note that Windows CI is currently failing because WSL setup is broken; I'm looking into that separately.

@mook-as mook-as added this to the 1.15 milestone Jun 18, 2024
@mook-as mook-as force-pushed the bats/k8s/helm-install-rancher/pull-image-separately branch from 4bee8ad to 7a767a1 Compare June 18, 2024 20:29
Copy link
Member

@jandubois jandubois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see individual comments

assert_true() {
run --separate-stderr "$@"
assert_success || return
is_true "$output" || return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this maybe a bit too forgiving? It will treat anything that is not one of '', '0', 'no', 'false' as true. So if for some reason the status field could say offline, it would be treated as true.

Comment on lines 127 to 131
if using_docker; then
try docker pull --quiet "rancher/rancher:v$rancher_chart_version"
else
try nerdctl pull --namespace k8s.io --quiet "rancher/rancher:v$rancher_chart_version"
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if using_docker; then
try docker pull --quiet "rancher/rancher:v$rancher_chart_version"
else
try nerdctl pull --namespace k8s.io --quiet "rancher/rancher:v$rancher_chart_version"
fi
local CONTAINERD_NAMESPACE=k8s.io
try ctrctl pull --quiet "rancher/rancher:v$rancher_chart_version"

Also technically --namespace should come before the pull subcommand because it is a global option, but nowadays nerdctl deals with it correctly even when specified later.

Comment on lines 135 to 140
try --max 60 --delay 10 assert_pod_log_line cattle-system rancher Listening on :443
try --max 60 --delay 10 assert_pod_log_line cattle-system rancher Starting catalog controller
try --max 60 --delay 10 assert_pod_log_line cattle-system rancher Watching metadata for rke-machine-config.cattle.io/v1
try --max 60 --delay 10 assert_pod_log_line cattle-system rancher 'Creating clusterRole for roleTemplate Cluster Owner (cluster-owner).'
try --max 60 --delay 10 assert_pod_log_line cattle-system rancher Rancher startup complete
try --max 120 --delay 10 assert_pod_log_line cattle-system rancher Created machine for node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a little shocked that the total timeouts add up to 70 minutes, but I guess there must be progress at least every 10 minutes.

Can the final step really take another 20 minutes???

Anyways, do we need a || return on each try statement, for consistency?

I guess not, if we follow the rule that wait_for_* functions are never called via try. Maybe we need a linter rule for that (not in this PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in practice we never need that much. I'll go tone down the max retries, based on what actually happened in the runs.

bats/tests/k8s/helm-install-rancher.bats Show resolved Hide resolved

local host
host=$(traefik_hostname) || return

comment "Installing rancher $rancher_chart_version"
# The helm install can take a long time, especially on CI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment should spell out that we intentionally don't use --wait and --timeout options for helm, but instead manually check progress along the way because things are so slow in CI.

bats/tests/k8s/helm-install-rancher.bats Show resolved Hide resolved
Comment on lines 198 to 200
# Unfortunately, the Rancher pod could get restarted, so we need to put this in a loop :(
try wait_for_rancher_pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the comment. The try will succeed when the pod is running. So how would it deal with the pod restarting after try has succeeded once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pod might restart in the middle of one of the steps inside wait_for_rancher_pod; if that happens, the later checks are likely to fail before the timeout. I'll try to reword this.

Comment on lines 219 to 226
# The rancher pod sometimes falls over on its own; retry in a loop
local i
for i in {1..10}; do
sleep 1
try --max 60 --delay 10 assert_kube_deployment_available --namespace cattle-system rancher
done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get the purpose of the loop. It checks that the deployment is available at least 10 times, but it can fall over up to 59 times between each check. So what does this actually prove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks that the deployment has succeeded ten times (i.e. that it has stopped flapping).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my point: how does it mean it has stopped flapping. It just means you have observed it 10 times when it was running, but that doesn't mean it has stopped flapping. Because you continue to retry until it is up again.

The app can be down for almost 100 minutes. As long as the app is temporarily up for at 10 seconds within every 10 minutes interval, the test will pass even though the app can be down 98 out of those 100 minutes.

bats/tests/k8s/helm-install-rancher.bats Show resolved Hide resolved
- Use long arguments for commands where available.
- Tell shfmt that we're linting for bats (for *.bash).

Signed-off-by: Mark Yen <[email protected]>
This test is prone to failing on macOS CI, possibly because the runners are
somewhat slower.  Try to improve this by _not_ waiting for the helm chart
deployment to finish, but instead manually checking for key log lines in
the containers until they are actually ready.

This also does a couple other things to help this test pass:
- Pre-pull the rancher image to ensure the image pulling doesn't make the
  machine busier than necessary.
- Update the arguments for the cert-manager chart (removing deprecated
  usage).

Signed-off-by: Mark Yen <[email protected]>
In case we got called right after a factory reset.

Signed-off-by: Mark Yen <[email protected]>
By default it is set to 3, which is unneeded in CI.

Signed-off-by: Mark Yen <[email protected]>
Since we'll be doing factory resets before each Kubernetes version (or in
the next test file, if it's the last version), there's no need to do an
uninstall that could take significant time.

Signed-off-by: Mark Yen <[email protected]>
This test needs a lot of RAM (to run Rancher Manager); disable ramdisk
to avoid hitting swap.

Signed-off-by: Mark Yen <[email protected]>
This is a Kubernetes deployment name that is spawned as part of Rancher
Manager.

Signed-off-by: Mark Yen <[email protected]>
Signed-off-by: Mark Yen <[email protected]>
@mook-as mook-as force-pushed the bats/k8s/helm-install-rancher/pull-image-separately branch from ff7056e to d59616f Compare June 20, 2024 00:23
Copy link
Contributor Author

@mook-as mook-as left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments (hopefully) addressed.

Windows is still failing for unrelated reasons:

kube-system svclb-traefik-jpmk2 0/2 CrashLoopBackOff 10 (2m21s ago) 5m24s

Comment on lines 198 to 200
# Unfortunately, the Rancher pod could get restarted, so we need to put this in a loop :(
try wait_for_rancher_pod
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pod might restart in the middle of one of the steps inside wait_for_rancher_pod; if that happens, the later checks are likely to fail before the timeout. I'll try to reword this.

Comment on lines 219 to 226
# The rancher pod sometimes falls over on its own; retry in a loop
local i
for i in {1..10}; do
sleep 1
try --max 60 --delay 10 assert_kube_deployment_available --namespace cattle-system rancher
done
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks that the deployment has succeeded ten times (i.e. that it has stopped flapping).

@mook-as mook-as requested a review from jandubois June 20, 2024 00:24
Copy link
Member

@jandubois jandubois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we need more || return clauses because the wait_for* functions are invoked by try.

I would prefer to not call them wait_for* if they are to be used like that, but we can rename them if/when we have a lint rule that enforces this.

I also still think "the loop" does not do what the comment says, and should either be modified or removed.

Comment on lines +132 to +133
try assert_pod_log_line cattle-system rancher Listening on :443
try assert_pod_log_line cattle-system rancher Starting catalog controller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since wait_for_rancher_pod is called via try (which calls it via run), we need to add || return to each step that can fail.

I also think the lines would be easier to read if the expected string was quoted:

Suggested change
try assert_pod_log_line cattle-system rancher Listening on :443
try assert_pod_log_line cattle-system rancher Starting catalog controller
try assert_pod_log_line cattle-system rancher "Listening on :443" || return
try assert_pod_log_line cattle-system rancher "Starting catalog controller" || return

But given that so many asserts all target the same namespace and app, I wonder if they shouldn't be selected by global variables as well:

Suggested change
try assert_pod_log_line cattle-system rancher Listening on :443
try assert_pod_log_line cattle-system rancher Starting catalog controller
local NAMESPACE=cattle-system
local APP=rancher
try assert_pod_log_line "Listening on :443" || return
try assert_pod_log_line "Starting catalog controller" || return

--set "extraArgs[0]=--enable-certificate-owner-ref=true" \
--create-namespace
try assert_not_empty_list helm list --namespace cert-manager --deployed --output json --selector name=cert-manager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an empty line before?

@@ -122,24 +232,32 @@ verify_rancher() {
skip_unless_host_ip
fi

# Get k3s logs if possible before things fail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment doesn't match the following commands; they don't fetch logs; they just list deployments and pods. I assume to get the information into the BATS output.

This comment seems to belong on top of the next block instead. I guess it became confusing because I asked you to put in more newlines. 😄

@mook-as
Copy link
Contributor Author

mook-as commented Jun 21, 2024

This is still (sometimes) failing, I think when the rancher pod restarts on a failure. This can lead to CrashLoopBackoff that takes longer than two hours to resolve. We might need to disable this test on macOS completely instead :(

Sample run: https://github.com/mook-as/rd/actions/runs/9606696487/job/26527750610

@mook-as mook-as force-pushed the bats/k8s/helm-install-rancher/pull-image-separately branch from f60d7b5 to d59616f Compare June 28, 2024 21:50
@jandubois jandubois self-assigned this Jul 4, 2024
@jandubois jandubois marked this pull request as draft July 11, 2024 17:04
@jandubois jandubois modified the milestones: 1.15, 1.16 Jul 19, 2024
@gunamata gunamata modified the milestones: 1.16, 1.17 Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants