-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor: reliability, retries, reduced k8s apiserver load #14
Refactor: reliability, retries, reduced k8s apiserver load #14
Conversation
|
@@ -69,6 +71,7 @@ function(jobName, agentEnv={}, stepEnvFile='', patchFunc=identity) patchFunc({ | |||
BUILDKITE_PLUGIN_K8S_RESOURCES_REQUEST_MEMORY: '', | |||
BUILDKITE_PLUGIN_K8S_RESOURCES_LIMIT_MEMORY: '', | |||
BUILDKITE_PLUGIN_K8S_WORKDIR: std.join('/', [env.BUILDKITE_BUILD_PATH, buildSubPath]), | |||
BUILDKITE_PLUGIN_K8S_JOB_TTL_SECONDS_AFTER_FINISHED: '86400', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really bad at Jsonnet, I've declared default value in YAML, but not sure how to keep this declaration in .jsonnet
😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct! We have these as default values then we override them with any values provided by the pipeline https://github.com/EmbarkStudios/k8s-buildkite-plugin/pull/14/files#diff-d8106e0da5023a3b7eb0401c475f88b2R75
@@ -63,6 +63,18 @@ configuration: | |||
type: string | |||
use-agent-node-affinity: | |||
type: boolean | |||
print-resulting-job-spec: | |||
type: boolean | |||
default: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, I couldn't find use of default
in any Buildkite plugin https://github.com/topics/buildkite-plugin, but JSON Schema that Buildkite uses seems to allow it https://json-schema.org/understanding-json-schema/reference/generic.html#annotations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting, does buildkite actually set a default value from the yaml on behalf of the plugin or does it allow it solely for documentation purposes?
@iffyio I think I've resolved all comments you've left up until now :) Btw, fyi I've also added Pod exit code propagation (updated PR description), very useful addition in my opinion 🙃 |
Thank you! I'll take another look at this soon. |
sleep "$log_complete_retry_interval_sec" | ||
done | ||
|
||
status="" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
status="" | |
status=1 |
we see issues where somehow we end up exiting with this default value:
/buildkite/plugins/github-com-artem-zinnatullin-k8s-buildkite-plugin-git-65192202c691578f25d42d04d57fb4b12d5b49c6/hooks/command: line 149: exit: null: numeric argument required
so having this more conservative exit code is a bit nicer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The null
is likely coming from setting the status
to the output of jq
e.g here https://github.com/EmbarkStudios/k8s-buildkite-plugin/pull/14/files#diff-7a7c160f9b73cfeaf4d8b93aee769fc1R188
@artem-zinnatullin I think we'd want to only set status if jq
finds a match? one way would be to use enable exit code with jq -e
so that it exits with non-zero if no match is found and then, only set the status if the exit code is 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jq -e
sounds good, annoying that you have to do set +e; jq -e; set -e;
or write to a file to like if jq -e > result
to handle both error code and output, but what can you do I guess 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Left a few comments, most of which should be minor fixes I think
If set to `true` plugin cleans up finished k8s job. | ||
Default value: `true`. | ||
|
||
If you have TTL controller or https://github.com/lwolf/kube-cleanup-operator running, it is highly recommended to set the value to `false` to reduce load on k8s api servers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar suggestion to the job-cleanup-via-plugin
option
while true | ||
do | ||
set +e | ||
pod_json=$(kubectl get pod "$pod_name" -o json) | ||
init_container_status="$(echo "$pod_json" | jq ".status.initContainerStatuses[0].state.terminated.exitCode")" | ||
if [[ -n "$init_container_status" && "$init_container_status" != "0" ]]; then | ||
echo "Warning: init container failed with exit code $init_container_status, this usually indicates plugin misconfiguration or infrastructure failure" | ||
status="$init_container_status" | ||
else | ||
status="$(echo "$pod_json" | jq ".status.containerStatuses[0].state.terminated.exitCode")" | ||
fi | ||
set -e | ||
if [[ -n "$status" ]]; then | ||
break | ||
else | ||
sleep "$job_status_retry_interval_sec" | ||
if [[ $timeout -gt 0 ]]; then | ||
(( counter -= job_status_retry_interval_sec )) | ||
fi | ||
fi | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work checking init containers too 🙌
status="0" | ||
else | ||
while true | ||
do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also print a message that says we're checking init containers status
status=0 | ||
status="0" | ||
else | ||
while true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like we'll keep looping if we keep failing to get the init container status? Let's break out of the loop after a hardcoded number of attempts or timeout instead?
@@ -0,0 +1 @@ | |||
.idea/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not include a gitignore in the PR since its possible to have git exclude these files locally https://stackoverflow.com/questions/1753070/how-do-i-configure-git-to-ignore-some-files-locally
Co-authored-by: Ifeanyi Ubah <[email protected]>
This landed in a8edf37 Thanks a lot @artem-zinnatullin I'll look to create a release soon |
Glad to see that, sorry I left it unfinished, got personal stuff in the way
…On Wed, Apr 7, 2021 at 9:45 AM Ifeanyi Ubah ***@***.***> wrote:
Closed #14 <#14>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHMDXAITLUSS2E37RLZTXLTHP5R5ANCNFSM4PGAEHFA>
.
|
This PR combines all the changes I had to make to the plugin to make it stable in our CI environment:
kubectl
invocations made by the plugin/jobs:Unable to connect to the server: net/http: TLS handshake timeout
/Unable to connect to the server: dial tcp x.x.x.x:443: i/o timeout
It is an evolution of the original PR #8 which I closed and promised to open a new one.
This PR:
kubectl
invocation in awhile true
loop with interval and timeout being configurable, this allows surviving k8s apiserver downtimes or say network issues while also being fast in a normal scenariokubectl
invocations to reduce load on k8s apiserverspec.ttlSecondsAfterFinished
or lwolf/kube-cleanup-operatorspec.ttlSecondsAfterFinished
BUILDKITE_BUILD_ID
to the Job environmentI know it sounds like a lot of changes but without them the plugin didn't perform well in our environment. I tried to make the code as clean, readable and safe as my Bash knowledge allows.
Hope this brings more users to the plugin and a better experience for them!