Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: reliability, retries, reduced k8s apiserver load #14

Closed

Conversation

artem-zinnatullin
Copy link
Contributor

@artem-zinnatullin artem-zinnatullin commented Jul 23, 2020

This PR combines all the changes I had to make to the plugin to make it stable in our CI environment:

  • AWS EKS cluster
  • k8s apiserver was failing under the load created by frequent kubectl invocations made by the plugin/jobs: Unable to connect to the server: net/http: TLS handshake timeout / Unable to connect to the server: dial tcp x.x.x.x:443: i/o timeout
  • ~10k CI jobs per day, peaking at 3-4k per hour

It is an evolution of the original PR #8 which I closed and promised to open a new one.

This PR:

  • Wraps every kubectl invocation in a while true loop with interval and timeout being configurable, this allows surviving k8s apiserver downtimes or say network issues while also being fast in a normal scenario
  • Unifies configuration values while allowing them to be overridden via environment variables
  • Changes log streaming to first wait for some logs to appear and only then stream the log, previously we often were getting logs printed either multiple times or never
  • Reduces number of kubectl invocations to reduce load on k8s apiserver
  • Adds option to skip job cleanup logic if it was successful for users relying on spec.ttlSecondsAfterFinished or lwolf/kube-cleanup-operator
  • Adds option to set spec.ttlSecondsAfterFinished
  • Adds while true loop with configurable interval and timeout to wait for completion of log streaming racing with job status check
  • Fixes Jobs fail if branch names are longer than 63 characters #5 when a job annotation like branch name is longer than 63 symbols (limitation in k8s)
  • Removes commit message from job annotation, it causes unrecoverable job metadata overflow if it's long
  • Propagates BUILDKITE_BUILD_ID to the Job environment
  • Propagates actual Pod exit code to Buildkite if Job fails, very useful for Buildkite retries based on exit code or other exit code analysis

I know it sounds like a lot of changes but without them the plugin didn't perform well in our environment. I tried to make the code as clean, readable and safe as my Bash knowledge allows.

Hope this brings more users to the plugin and a better experience for them!

@coderanger
Copy link

coderanger commented Sep 8, 2020

set -x will only echo the running commands, it's not very verbose. And stdout/err desync is fairly common depending on your underlying container logging driver :) Docker's default one collects the two independently and in pipes so they are basically batched. The env var change is required and correct, the input format looks like KEY="VALUE" as confirmed by the Buildkite docs. It might even need more work depending on the formatting used for values that actually contain ", I haven't tested that yet.

@@ -69,6 +71,7 @@ function(jobName, agentEnv={}, stepEnvFile='', patchFunc=identity) patchFunc({
BUILDKITE_PLUGIN_K8S_RESOURCES_REQUEST_MEMORY: '',
BUILDKITE_PLUGIN_K8S_RESOURCES_LIMIT_MEMORY: '',
BUILDKITE_PLUGIN_K8S_WORKDIR: std.join('/', [env.BUILDKITE_BUILD_PATH, buildSubPath]),
BUILDKITE_PLUGIN_K8S_JOB_TTL_SECONDS_AFTER_FINISHED: '86400',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really bad at Jsonnet, I've declared default value in YAML, but not sure how to keep this declaration in .jsonnet 😬

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct! We have these as default values then we override them with any values provided by the pipeline https://github.com/EmbarkStudios/k8s-buildkite-plugin/pull/14/files#diff-d8106e0da5023a3b7eb0401c475f88b2R75

@@ -63,6 +63,18 @@ configuration:
type: string
use-agent-node-affinity:
type: boolean
print-resulting-job-spec:
type: boolean
default: false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, I couldn't find use of default in any Buildkite plugin https://github.com/topics/buildkite-plugin, but JSON Schema that Buildkite uses seems to allow it https://json-schema.org/understanding-json-schema/reference/generic.html#annotations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting, does buildkite actually set a default value from the yaml on behalf of the plugin or does it allow it solely for documentation purposes?

@artem-zinnatullin
Copy link
Contributor Author

@iffyio I think I've resolved all comments you've left up until now :)

Btw, fyi I've also added Pod exit code propagation (updated PR description), very useful addition in my opinion 🙃

@iffyio
Copy link
Contributor

iffyio commented Oct 8, 2020

@iffyio I think I've resolved all comments you've left up until now :)

Btw, fyi I've also added Pod exit code propagation (updated PR description), very useful addition in my opinion

Thank you! I'll take another look at this soon.

sleep "$log_complete_retry_interval_sec"
done

status=""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
status=""
status=1

we see issues where somehow we end up exiting with this default value:

/buildkite/plugins/github-com-artem-zinnatullin-k8s-buildkite-plugin-git-65192202c691578f25d42d04d57fb4b12d5b49c6/hooks/command: line 149: exit: null: numeric argument required

so having this more conservative exit code is a bit nicer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The null is likely coming from setting the status to the output of jq e.g here https://github.com/EmbarkStudios/k8s-buildkite-plugin/pull/14/files#diff-7a7c160f9b73cfeaf4d8b93aee769fc1R188

@artem-zinnatullin I think we'd want to only set status if jq finds a match? one way would be to use enable exit code with jq -e so that it exits with non-zero if no match is found and then, only set the status if the exit code is 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jq -e sounds good, annoying that you have to do set +e; jq -e; set -e; or write to a file to like if jq -e > result to handle both error code and output, but what can you do I guess 😅

Copy link
Contributor

@iffyio iffyio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Left a few comments, most of which should be minor fixes I think

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
If set to `true` plugin cleans up finished k8s job.
Default value: `true`.

If you have TTL controller or https://github.com/lwolf/kube-cleanup-operator running, it is highly recommended to set the value to `false` to reduce load on k8s api servers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar suggestion to the job-cleanup-via-plugin option

Comment on lines +179 to +199
while true
do
set +e
pod_json=$(kubectl get pod "$pod_name" -o json)
init_container_status="$(echo "$pod_json" | jq ".status.initContainerStatuses[0].state.terminated.exitCode")"
if [[ -n "$init_container_status" && "$init_container_status" != "0" ]]; then
echo "Warning: init container failed with exit code $init_container_status, this usually indicates plugin misconfiguration or infrastructure failure"
status="$init_container_status"
else
status="$(echo "$pod_json" | jq ".status.containerStatuses[0].state.terminated.exitCode")"
fi
set -e
if [[ -n "$status" ]]; then
break
else
sleep "$job_status_retry_interval_sec"
if [[ $timeout -gt 0 ]]; then
(( counter -= job_status_retry_interval_sec ))
fi
fi
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work checking init containers too 🙌

hooks/command Outdated Show resolved Hide resolved
status="0"
else
while true
do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also print a message that says we're checking init containers status

status=0
status="0"
else
while true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like we'll keep looping if we keep failing to get the init container status? Let's break out of the loop after a hardcoded number of attempts or timeout instead?

@@ -0,0 +1 @@
.idea/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not include a gitignore in the PR since its possible to have git exclude these files locally https://stackoverflow.com/questions/1753070/how-do-i-configure-git-to-ignore-some-files-locally

@iffyio
Copy link
Contributor

iffyio commented Apr 7, 2021

This landed in a8edf37 Thanks a lot @artem-zinnatullin I'll look to create a release soon

@iffyio iffyio closed this Apr 7, 2021
@artem-zinnatullin
Copy link
Contributor Author

artem-zinnatullin commented Apr 7, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Jobs fail if branch names are longer than 63 characters
4 participants