-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: BATS: Make k8s/helm-install-rancher pass on macOS #7069
base: main
Are you sure you want to change the base?
CI: BATS: Make k8s/helm-install-rancher pass on macOS #7069
Conversation
4bee8ad
to
7a767a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see individual comments
assert_true() { | ||
run --separate-stderr "$@" | ||
assert_success || return | ||
is_true "$output" || return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this maybe a bit too forgiving? It will treat anything that is not one of '', '0', 'no', 'false'
as true
. So if for some reason the status field could say offline
, it would be treated as true.
if using_docker; then | ||
try docker pull --quiet "rancher/rancher:v$rancher_chart_version" | ||
else | ||
try nerdctl pull --namespace k8s.io --quiet "rancher/rancher:v$rancher_chart_version" | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if using_docker; then | |
try docker pull --quiet "rancher/rancher:v$rancher_chart_version" | |
else | |
try nerdctl pull --namespace k8s.io --quiet "rancher/rancher:v$rancher_chart_version" | |
fi | |
local CONTAINERD_NAMESPACE=k8s.io | |
try ctrctl pull --quiet "rancher/rancher:v$rancher_chart_version" |
Also technically --namespace
should come before the pull
subcommand because it is a global option, but nowadays nerdctl
deals with it correctly even when specified later.
try --max 60 --delay 10 assert_pod_log_line cattle-system rancher Listening on :443 | ||
try --max 60 --delay 10 assert_pod_log_line cattle-system rancher Starting catalog controller | ||
try --max 60 --delay 10 assert_pod_log_line cattle-system rancher Watching metadata for rke-machine-config.cattle.io/v1 | ||
try --max 60 --delay 10 assert_pod_log_line cattle-system rancher 'Creating clusterRole for roleTemplate Cluster Owner (cluster-owner).' | ||
try --max 60 --delay 10 assert_pod_log_line cattle-system rancher Rancher startup complete | ||
try --max 120 --delay 10 assert_pod_log_line cattle-system rancher Created machine for node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a little shocked that the total timeouts add up to 70 minutes, but I guess there must be progress at least every 10 minutes.
Can the final step really take another 20 minutes???
Anyways, do we need a || return
on each try
statement, for consistency?
I guess not, if we follow the rule that wait_for_*
functions are never called via try
. Maybe we need a linter rule for that (not in this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, in practice we never need that much. I'll go tone down the max retries, based on what actually happened in the runs.
|
||
local host | ||
host=$(traefik_hostname) || return | ||
|
||
comment "Installing rancher $rancher_chart_version" | ||
# The helm install can take a long time, especially on CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment should spell out that we intentionally don't use --wait
and --timeout
options for helm
, but instead manually check progress along the way because things are so slow in CI.
# Unfortunately, the Rancher pod could get restarted, so we need to put this in a loop :( | ||
try wait_for_rancher_pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the comment. The try will succeed when the pod is running. So how would it deal with the pod restarting after try
has succeeded once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pod might restart in the middle of one of the steps inside wait_for_rancher_pod
; if that happens, the later checks are likely to fail before the timeout. I'll try to reword this.
# The rancher pod sometimes falls over on its own; retry in a loop | ||
local i | ||
for i in {1..10}; do | ||
sleep 1 | ||
try --max 60 --delay 10 assert_kube_deployment_available --namespace cattle-system rancher | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get the purpose of the loop. It checks that the deployment is available at least 10 times, but it can fall over up to 59 times between each check. So what does this actually prove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This checks that the deployment has succeeded ten times (i.e. that it has stopped flapping).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my point: how does it mean it has stopped flapping. It just means you have observed it 10 times when it was running, but that doesn't mean it has stopped flapping. Because you continue to retry until it is up again.
The app can be down for almost 100 minutes. As long as the app is temporarily up for at 10 seconds within every 10 minutes interval, the test will pass even though the app can be down 98 out of those 100 minutes.
- Use long arguments for commands where available. - Tell shfmt that we're linting for bats (for *.bash). Signed-off-by: Mark Yen <[email protected]>
This test is prone to failing on macOS CI, possibly because the runners are somewhat slower. Try to improve this by _not_ waiting for the helm chart deployment to finish, but instead manually checking for key log lines in the containers until they are actually ready. This also does a couple other things to help this test pass: - Pre-pull the rancher image to ensure the image pulling doesn't make the machine busier than necessary. - Update the arguments for the cert-manager chart (removing deprecated usage). Signed-off-by: Mark Yen <[email protected]>
In case we got called right after a factory reset. Signed-off-by: Mark Yen <[email protected]>
By default it is set to 3, which is unneeded in CI. Signed-off-by: Mark Yen <[email protected]>
Since we'll be doing factory resets before each Kubernetes version (or in the next test file, if it's the last version), there's no need to do an uninstall that could take significant time. Signed-off-by: Mark Yen <[email protected]>
This test needs a lot of RAM (to run Rancher Manager); disable ramdisk to avoid hitting swap. Signed-off-by: Mark Yen <[email protected]>
This is a Kubernetes deployment name that is spawned as part of Rancher Manager. Signed-off-by: Mark Yen <[email protected]>
Signed-off-by: Mark Yen <[email protected]>
Signed-off-by: Mark Yen <[email protected]>
ff7056e
to
d59616f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments (hopefully) addressed.
Windows is still failing for unrelated reasons:
kube-system svclb-traefik-jpmk2 0/2 CrashLoopBackOff 10 (2m21s ago) 5m24s
# Unfortunately, the Rancher pod could get restarted, so we need to put this in a loop :( | ||
try wait_for_rancher_pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pod might restart in the middle of one of the steps inside wait_for_rancher_pod
; if that happens, the later checks are likely to fail before the timeout. I'll try to reword this.
# The rancher pod sometimes falls over on its own; retry in a loop | ||
local i | ||
for i in {1..10}; do | ||
sleep 1 | ||
try --max 60 --delay 10 assert_kube_deployment_available --namespace cattle-system rancher | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This checks that the deployment has succeeded ten times (i.e. that it has stopped flapping).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we need more || return
clauses because the wait_for*
functions are invoked by try
.
I would prefer to not call them wait_for*
if they are to be used like that, but we can rename them if/when we have a lint rule that enforces this.
I also still think "the loop" does not do what the comment says, and should either be modified or removed.
try assert_pod_log_line cattle-system rancher Listening on :443 | ||
try assert_pod_log_line cattle-system rancher Starting catalog controller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since wait_for_rancher_pod
is called via try
(which calls it via run
), we need to add || return
to each step that can fail.
I also think the lines would be easier to read if the expected string was quoted:
try assert_pod_log_line cattle-system rancher Listening on :443 | |
try assert_pod_log_line cattle-system rancher Starting catalog controller | |
try assert_pod_log_line cattle-system rancher "Listening on :443" || return | |
try assert_pod_log_line cattle-system rancher "Starting catalog controller" || return |
But given that so many asserts all target the same namespace and app, I wonder if they shouldn't be selected by global variables as well:
try assert_pod_log_line cattle-system rancher Listening on :443 | |
try assert_pod_log_line cattle-system rancher Starting catalog controller | |
local NAMESPACE=cattle-system | |
local APP=rancher | |
try assert_pod_log_line "Listening on :443" || return | |
try assert_pod_log_line "Starting catalog controller" || return |
--set "extraArgs[0]=--enable-certificate-owner-ref=true" \ | ||
--create-namespace | ||
try assert_not_empty_list helm list --namespace cert-manager --deployed --output json --selector name=cert-manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an empty line before?
@@ -122,24 +232,32 @@ verify_rancher() { | |||
skip_unless_host_ip | |||
fi | |||
|
|||
# Get k3s logs if possible before things fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment doesn't match the following commands; they don't fetch logs; they just list deployments and pods. I assume to get the information into the BATS output.
This comment seems to belong on top of the next block instead. I guess it became confusing because I asked you to put in more newlines. 😄
This is still (sometimes) failing, I think when the rancher pod restarts on a failure. This can lead to CrashLoopBackoff that takes longer than two hours to resolve. We might need to disable this test on macOS completely instead :( Sample run: https://github.com/mook-as/rd/actions/runs/9606696487/job/26527750610 |
f60d7b5
to
d59616f
Compare
This changes the
k8s/helm-install-rancher
BATS test to not wait withhelm
, but instead examine the deployed pods for log lines (and otherwise wait for objects manually). This lets the test pass in CI, because macOS CI hardware appears to be slower.Note that Windows CI is currently failing because WSL setup is broken; I'm looking into that separately.