CI: BATS: Make k8s/helm-install-rancher pass on macOS #7069

mook-as · 2024-06-18T19:13:00Z

This changes the k8s/helm-install-rancher BATS test to not wait with helm, but instead examine the deployed pods for log lines (and otherwise wait for objects manually). This lets the test pass in CI, because macOS CI hardware appears to be slower.

Note that Windows CI is currently failing because WSL setup is broken; I'm looking into that separately.

jandubois

Please see individual comments

jandubois · 2024-06-18T21:02:07Z

bats/tests/k8s/helm-install-rancher.bats

+assert_true() {
+    run --separate-stderr "$@"
+    assert_success || return
+    is_true "$output" || return


Is this maybe a bit too forgiving? It will treat anything that is not one of '', '0', 'no', 'false' as true. So if for some reason the status field could say offline, it would be treated as true.

jandubois · 2024-06-18T21:07:35Z

bats/tests/k8s/helm-install-rancher.bats

+    if using_docker; then
+        try docker pull --quiet "rancher/rancher:v$rancher_chart_version"
+    else
+        try nerdctl pull --namespace k8s.io --quiet "rancher/rancher:v$rancher_chart_version"
+    fi


Suggested change

if using_docker; then

try docker pull --quiet "rancher/rancher:v$rancher_chart_version"

else

try nerdctl pull --namespace k8s.io --quiet "rancher/rancher:v$rancher_chart_version"

fi

local CONTAINERD_NAMESPACE=k8s.io

try ctrctl pull --quiet "rancher/rancher:v$rancher_chart_version"

Also technically --namespace should come before the pull subcommand because it is a global option, but nowadays nerdctl deals with it correctly even when specified later.

jandubois · 2024-06-18T21:10:35Z

bats/tests/k8s/helm-install-rancher.bats

+    try --max 60 --delay 10 assert_pod_log_line cattle-system rancher Listening on :443
+    try --max 60 --delay 10 assert_pod_log_line cattle-system rancher Starting catalog controller
+    try --max 60 --delay 10 assert_pod_log_line cattle-system rancher Watching metadata for rke-machine-config.cattle.io/v1
+    try --max 60 --delay 10 assert_pod_log_line cattle-system rancher 'Creating clusterRole for roleTemplate Cluster Owner (cluster-owner).'
+    try --max 60 --delay 10 assert_pod_log_line cattle-system rancher Rancher startup complete
+    try --max 120 --delay 10 assert_pod_log_line cattle-system rancher Created machine for node


I was a little shocked that the total timeouts add up to 70 minutes, but I guess there must be progress at least every 10 minutes.

Can the final step really take another 20 minutes???

Anyways, do we need a || return on each try statement, for consistency?

I guess not, if we follow the rule that wait_for_* functions are never called via try. Maybe we need a linter rule for that (not in this PR).

Yeah, in practice we never need that much. I'll go tone down the max retries, based on what actually happened in the runs.

bats/tests/k8s/helm-install-rancher.bats

jandubois · 2024-06-18T21:18:18Z

bats/tests/k8s/helm-install-rancher.bats


    local host
    host=$(traefik_hostname) || return

    comment "Installing rancher $rancher_chart_version"
+    # The helm install can take a long time, especially on CI


I think this comment should spell out that we intentionally don't use --wait and --timeout options for helm, but instead manually check progress along the way because things are so slow in CI.

bats/tests/k8s/helm-install-rancher.bats

jandubois · 2024-06-18T21:22:09Z

bats/tests/k8s/helm-install-rancher.bats

+    # Unfortunately, the Rancher pod could get restarted, so we need to put this in a loop :(
+    try wait_for_rancher_pod


I don't understand the comment. The try will succeed when the pod is running. So how would it deal with the pod restarting after try has succeeded once?

The pod might restart in the middle of one of the steps inside wait_for_rancher_pod; if that happens, the later checks are likely to fail before the timeout. I'll try to reword this.

jandubois · 2024-06-18T21:27:20Z

bats/tests/k8s/helm-install-rancher.bats

+    # The rancher pod sometimes falls over on its own; retry in a loop
+    local i
+    for i in {1..10}; do
+        sleep 1
+        try --max 60 --delay 10 assert_kube_deployment_available --namespace cattle-system rancher
+    done


I don't get the purpose of the loop. It checks that the deployment is available at least 10 times, but it can fall over up to 59 times between each check. So what does this actually prove?

This checks that the deployment has succeeded ten times (i.e. that it has stopped flapping).

That was my point: how does it mean it has stopped flapping. It just means you have observed it 10 times when it was running, but that doesn't mean it has stopped flapping. Because you continue to retry until it is up again.

The app can be down for almost 100 minutes. As long as the app is temporarily up for at 10 seconds within every 10 minutes interval, the test will pass even though the app can be down 98 out of those 100 minutes.

bats/tests/k8s/helm-install-rancher.bats

- Use long arguments for commands where available. - Tell shfmt that we're linting for bats (for *.bash). Signed-off-by: Mark Yen <[email protected]>

This test is prone to failing on macOS CI, possibly because the runners are somewhat slower. Try to improve this by _not_ waiting for the helm chart deployment to finish, but instead manually checking for key log lines in the containers until they are actually ready. This also does a couple other things to help this test pass: - Pre-pull the rancher image to ensure the image pulling doesn't make the machine busier than necessary. - Update the arguments for the cert-manager chart (removing deprecated usage). Signed-off-by: Mark Yen <[email protected]>

In case we got called right after a factory reset. Signed-off-by: Mark Yen <[email protected]>

By default it is set to 3, which is unneeded in CI. Signed-off-by: Mark Yen <[email protected]>

Since we'll be doing factory resets before each Kubernetes version (or in the next test file, if it's the last version), there's no need to do an uninstall that could take significant time. Signed-off-by: Mark Yen <[email protected]>

This test needs a lot of RAM (to run Rancher Manager); disable ramdisk to avoid hitting swap. Signed-off-by: Mark Yen <[email protected]>

This is a Kubernetes deployment name that is spawned as part of Rancher Manager. Signed-off-by: Mark Yen <[email protected]>

Signed-off-by: Mark Yen <[email protected]>

mook-as

Comments (hopefully) addressed.

Windows is still failing for unrelated reasons:

kube-system svclb-traefik-jpmk2 0/2 CrashLoopBackOff 10 (2m21s ago) 5m24s

mook-as · 2024-06-18T23:32:41Z

bats/tests/k8s/helm-install-rancher.bats

+    # Unfortunately, the Rancher pod could get restarted, so we need to put this in a loop :(
+    try wait_for_rancher_pod


The pod might restart in the middle of one of the steps inside wait_for_rancher_pod; if that happens, the later checks are likely to fail before the timeout. I'll try to reword this.

mook-as · 2024-06-18T23:34:05Z

bats/tests/k8s/helm-install-rancher.bats

+    # The rancher pod sometimes falls over on its own; retry in a loop
+    local i
+    for i in {1..10}; do
+        sleep 1
+        try --max 60 --delay 10 assert_kube_deployment_available --namespace cattle-system rancher
+    done


This checks that the deployment has succeeded ten times (i.e. that it has stopped flapping).

jandubois

Looks like we need more || return clauses because the wait_for* functions are invoked by try.

I would prefer to not call them wait_for* if they are to be used like that, but we can rename them if/when we have a lint rule that enforces this.

I also still think "the loop" does not do what the comment says, and should either be modified or removed.

jandubois · 2024-06-20T17:50:51Z

bats/tests/k8s/helm-install-rancher.bats

+    try assert_pod_log_line cattle-system rancher Listening on :443
+    try assert_pod_log_line cattle-system rancher Starting catalog controller


Since wait_for_rancher_pod is called via try (which calls it via run), we need to add || return to each step that can fail.

I also think the lines would be easier to read if the expected string was quoted:

Suggested change

try assert_pod_log_line cattle-system rancher Listening on :443

try assert_pod_log_line cattle-system rancher Starting catalog controller

try assert_pod_log_line cattle-system rancher "Listening on :443" || return

try assert_pod_log_line cattle-system rancher "Starting catalog controller" || return

But given that so many asserts all target the same namespace and app, I wonder if they shouldn't be selected by global variables as well:

Suggested change

try assert_pod_log_line cattle-system rancher Listening on :443

try assert_pod_log_line cattle-system rancher Starting catalog controller

local NAMESPACE=cattle-system

local APP=rancher

try assert_pod_log_line "Listening on :443" || return

try assert_pod_log_line "Starting catalog controller" || return

jandubois · 2024-06-20T17:52:29Z

bats/tests/k8s/helm-install-rancher.bats

        --set "extraArgs[0]=--enable-certificate-owner-ref=true" \
        --create-namespace
+    try assert_not_empty_list helm list --namespace cert-manager --deployed --output json --selector name=cert-manager


Add an empty line before?

jandubois · 2024-06-20T18:01:15Z

bats/tests/k8s/helm-install-rancher.bats

@@ -122,24 +232,32 @@ verify_rancher() {
        skip_unless_host_ip
    fi

+    # Get k3s logs if possible before things fail


The comment doesn't match the following commands; they don't fetch logs; they just list deployments and pods. I assume to get the information into the BATS output.

This comment seems to belong on top of the next block instead. I guess it became confusing because I asked you to put in more newlines. 😄

mook-as · 2024-06-21T17:51:17Z

This is still (sometimes) failing, I think when the rancher pod restarts on a failure. This can lead to CrashLoopBackoff that takes longer than two hours to resolve. We might need to disable this test on macOS completely instead :(

Sample run: https://github.com/mook-as/rd/actions/runs/9606696487/job/26527750610

mook-as added this to the 1.15 milestone Jun 18, 2024

mook-as force-pushed the bats/k8s/helm-install-rancher/pull-image-separately branch from 4bee8ad to 7a767a1 Compare June 18, 2024 20:29

jandubois requested changes Jun 18, 2024

View reviewed changes

mook-as added 9 commits June 19, 2024 16:00

BATS: Linting improvements

b1b9d62

- Use long arguments for commands where available. - Tell shfmt that we're linting for bats (for *.bash). Signed-off-by: Mark Yen <[email protected]>

BATS: when capturing logs, allow missing config file

a7a5f55

In case we got called right after a factory reset. Signed-off-by: Mark Yen <[email protected]>

BATS: k8s/rancher: Set replicas to 1

d9fc040

By default it is set to 3, which is unneeded in CI. Signed-off-by: Mark Yen <[email protected]>

BATS: helm-install-rancher: Drop uninstall

460cf06

Since we'll be doing factory resets before each Kubernetes version (or in the next test file, if it's the last version), there's no need to do an uninstall that could take significant time. Signed-off-by: Mark Yen <[email protected]>

BATS: k8s/helm-install-rancher: Disable ramdisk

55232e4

This test needs a lot of RAM (to run Rancher Manager); disable ramdisk to avoid hitting swap. Signed-off-by: Mark Yen <[email protected]>

Spelling: Add word

3290894

This is a Kubernetes deployment name that is spawned as part of Rancher Manager. Signed-off-by: Mark Yen <[email protected]>

BATS: Fix lint issues

17b6a77

Signed-off-by: Mark Yen <[email protected]>

BATS: k8s/helm-install-rancher: Address review comments

d59616f

Signed-off-by: Mark Yen <[email protected]>

mook-as force-pushed the bats/k8s/helm-install-rancher/pull-image-separately branch from ff7056e to d59616f Compare June 20, 2024 00:23

mook-as commented Jun 20, 2024

View reviewed changes

mook-as requested a review from jandubois June 20, 2024 00:24

jandubois requested changes Jun 20, 2024

View reviewed changes

mook-as force-pushed the bats/k8s/helm-install-rancher/pull-image-separately branch from f60d7b5 to d59616f Compare June 28, 2024 21:50

jandubois self-assigned this Jul 4, 2024

jandubois marked this pull request as draft July 11, 2024 17:04

jandubois modified the milestones: 1.15, 1.16 Jul 19, 2024

gunamata modified the milestones: 1.16, 1.17 Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: BATS: Make k8s/helm-install-rancher pass on macOS #7069

CI: BATS: Make k8s/helm-install-rancher pass on macOS #7069

mook-as commented Jun 18, 2024

jandubois left a comment

jandubois Jun 18, 2024

jandubois Jun 18, 2024

jandubois Jun 18, 2024

mook-as Jun 18, 2024

jandubois Jun 18, 2024

jandubois Jun 18, 2024

mook-as Jun 18, 2024

jandubois Jun 18, 2024

mook-as Jun 18, 2024

jandubois Jun 20, 2024

mook-as left a comment

mook-as Jun 18, 2024

mook-as Jun 18, 2024

jandubois left a comment

jandubois Jun 20, 2024

jandubois Jun 20, 2024

jandubois Jun 20, 2024

mook-as commented Jun 21, 2024

		# Unfortunately, the Rancher pod could get restarted, so we need to put this in a loop :(
		try wait_for_rancher_pod

		try assert_pod_log_line cattle-system rancher Listening on :443
		try assert_pod_log_line cattle-system rancher Starting catalog controller

-    try assert_pod_log_line cattle-system rancher Listening on :443
-    try assert_pod_log_line cattle-system rancher Starting catalog controller
+    local NAMESPACE=cattle-system
+    local APP=rancher
+    try assert_pod_log_line "Listening on :443" || return
+    try assert_pod_log_line "Starting catalog controller" || return

CI: BATS: Make k8s/helm-install-rancher pass on macOS #7069

Are you sure you want to change the base?

CI: BATS: Make k8s/helm-install-rancher pass on macOS #7069

Conversation

mook-as commented Jun 18, 2024

jandubois left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mook-as left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jandubois left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mook-as commented Jun 21, 2024