-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor: reliability, retries, reduced k8s apiserver load #14
Changes from all commits
64d0054
df92276
2142fe6
bb3fdf9
e628984
7cf0ad5
e86fe72
1a8f2e6
a7e7b62
391faaa
27af568
b55eaf8
2f581ba
df7444b
54583a9
81e99da
75cbe9c
97ae246
724410e
c4735dc
5f03990
0ba78db
d63009d
7328036
5f9f582
8fe3ce5
d6b7b52
df4880b
2104237
3bc537a
b0ef161
e28e51a
0ea00ea
e1b42c2
7f0ba66
1b34a6d
6493b5d
b494d88
9c8e6c4
154765b
a08ee4c
0873b04
98c92c8
8dc6865
259cd3b
6519220
32b9e28
e8e9cfc
8603721
dffecae
5daa8a1
941e86b
2085c88
1cef288
f89209b
071bb47
b5a6969
3d1589a
f46dac5
d948c35
cb2f53e
09dd224
f7a867d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
.idea/ | ||
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -239,6 +239,92 @@ patch: | | |
} | ||
``` | ||
|
||
### `print-resulting-job-spec` (optional, boolean) | ||
|
||
If set to `true`, the resulting k8s job spec is printed to the log. This can be useful when debugging. | ||
|
||
### `job-ttl-seconds-after-finished` (optional, integer) | ||
|
||
Configures [`spec.ttlSecondsAfterFinished`](https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/) on the k8s job, requires TTL Controller enabled in the cluster, otherwise ignored. | ||
Default value: `86400`. | ||
|
||
### `jobs-cleanup-via-plugin` (optional, boolean) | ||
artem-zinnatullin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
If set to `true`, the plugin cleans up k8s jobs older than 1 day even if they're still running. | ||
Default value: `true`. | ||
|
||
If you have [TTL Controller](https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/) enabled or some other means to cleanup finished jobs, it is recommended to set the value to `false` in order to reduce load on k8s api servers. | ||
|
||
### `job-cleanup-after-finished-via-plugin` (optional, boolean) | ||
|
||
If set to `true` plugin cleans up finished k8s job. | ||
Default value: `true`. | ||
|
||
If you have TTL controller or https://github.com/lwolf/kube-cleanup-operator running, it is highly recommended to set the value to `false` to reduce load on k8s api servers. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar suggestion to the |
||
|
||
## Low Level Configuration via Environment Variables | ||
|
||
Some of the plugin options can be configured via environment variables as following ([also see Buildkite docs](https://buildkite.com/docs/pipelines/environment-variables#defining-your-own)): | ||
|
||
```yaml | ||
env: | ||
BUILDKITE_PLUGIN_K8S_JOB_APPLY_RETRY_INTERVAL_SEC: "10" | ||
``` | ||
|
||
### BUILDKITE_PLUGIN_K8S_JOB_APPLY_RETRY_INTERVAL_SEC | ||
|
||
- Configures the interval between attempts to schedule the k8s job | ||
- Default: `5` | ||
- Unit type: integer seconds | ||
|
||
### BUILDKITE_PLUGIN_K8S_JOB_APPLY_TIMEOUT_SEC | ||
|
||
- Configures the total time limit across attempts to schedule the k8s job | ||
- Default: `120` | ||
- Unit type: integer seconds | ||
|
||
### BUILDKITE_PLUGIN_K8S_JOB_STATUS_RETRY_INTERVAL_SEC | ||
|
||
- Configures the interval between attempts to get k8s job status | ||
- Default: `5` | ||
- Unit type: integer seconds | ||
|
||
### BUILDKITE_PLUGIN_K8S_LOG_COMPLETE_RETRY_INTERVAL_SEC | ||
|
||
- Configures the interval between attempts to verify that log streaming has ended | ||
- Default: `1` | ||
- Unit type: integer seconds | ||
|
||
### BUILDKITE_PLUGIN_K8S_LOG_COMPLETE_TIMEOUT_SEC | ||
|
||
- Configures the total time limit across attempts to verify that log streaming has ended | ||
- Default: `30` | ||
- Unit type: integer seconds | ||
|
||
### BUILDKITE_PLUGIN_K8S_LOG_RETRY_INTERVAL_SEC | ||
|
||
- Configures the interval between attempts to stream job logs | ||
- Default: `3` | ||
- Unit type: integer seconds | ||
|
||
### BUILDKITE_PLUGIN_K8S_LOG_ATTEMPT_TIMEOUT_SEC | ||
|
||
- Configures time limit for a _single_ plugin attempt to stream job logs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can skip this option and hardcode the default value in the script since its even lower level but also doesn't seem very useful to change? Also it'll be one less option to configure ^^ |
||
- Default: `5` | ||
- Unit type: integer seconds | ||
|
||
### BUILDKITE_PLUGIN_K8S_JOB_CLEANUP_RETRY_INTERVAL_SEC | ||
|
||
- Configures the interval between attempts to cleanup finished jobs | ||
- Default: `5` | ||
- Unit type: integer seconds | ||
|
||
### BUILDKITE_PLUGIN_K8S_JOB_CLEANUP_TIMEOUT_SEC | ||
|
||
- Configures the total time limit across attempts to cleanup finished jobs | ||
- Default: `60` | ||
- Unit type: integer seconds | ||
|
||
## Contributing | ||
|
||
We welcome community contributions to this project. | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -13,64 +13,195 @@ fi | |||||
((timeout=BUILDKITE_TIMEOUT*60)) | ||||||
export BUILDKITE_TIMEOUT | ||||||
|
||||||
# Default values can be overridden by setting "BUILDKITE_PLUGIN_K8S_*" env vars as used below. | ||||||
readonly job_apply_retry_interval_sec="${BUILDKITE_PLUGIN_K8S_JOB_APPLY_RETRY_INTERVAL_SEC:-5}" | ||||||
readonly job_apply_timeout_sec="${BUILDKITE_PLUGIN_K8S_JOB_APPLY_TIMEOUT_SEC:-120}" | ||||||
readonly log_loop_retry_interval_sec="${BUILDKITE_PLUGIN_K8S_LOG_RETRY_INTERVAL_SEC:-3}" | ||||||
readonly log_attempt_timeout_sec="${BUILDKITE_PLUGIN_K8S_LOG_ATTEMPT_TIMEOUT_SEC:-5}" | ||||||
readonly job_status_retry_interval_sec="${BUILDKITE_PLUGIN_K8S_JOB_STATUS_RETRY_INTERVAL_SEC:-5}" | ||||||
readonly log_complete_retry_interval_sec="${BUILDKITE_PLUGIN_K8S_LOG_COMPLETE_RETRY_INTERVAL_SEC:-1}" | ||||||
readonly log_complete_timeout_sec="${BUILDKITE_PLUGIN_K8S_LOG_COMPLETE_TIMEOUT_SEC:-30}" | ||||||
readonly use_agent_node_affinity=${BUILDKITE_PLUGIN_K8S_USE_AGENT_NODE_AFFINITY:-false} | ||||||
readonly print_resulting_job_spec=${BUILDKITE_PLUGIN_K8S_PRINT_RESULTING_JOB_SPEC:-false} | ||||||
artem-zinnatullin marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
readonly jobs_cleanup_via_plugin=${BUILDKITE_PLUGIN_K8S_JOBS_CLEANUP_VIA_PLUGIN:-} | ||||||
|
||||||
readonly bootstrap_container_log_complete_marker_file="$(mktemp)" | ||||||
readonly step_container_log_complete_marker_file="$(mktemp)" | ||||||
|
||||||
function cleanup { | ||||||
# Delete all jobs older than a day | ||||||
kubectl delete job "$(kubectl get job -l buildkite/plugin=k8s | awk 'match($4,/[0-9]+d/) {print $1}')" 2>/dev/null || true | ||||||
artem-zinnatullin marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
rm -f "$bootstrap_container_log_complete_marker_file" "$step_container_log_complete_marker_file" | ||||||
|
||||||
if [[ "$jobs_cleanup_via_plugin" == "true" ]]; then | ||||||
# Delete all jobs older than a day. | ||||||
kubectl delete job "$(kubectl get job -l buildkite/plugin=k8s | awk 'match($4,/[0-9]+d/) {print $1}')" 2>/dev/null || true | ||||||
fi | ||||||
} | ||||||
trap cleanup EXIT | ||||||
|
||||||
trap cleanup EXIT | ||||||
|
||||||
function tail_logs { | ||||||
while true ; do | ||||||
stdbuf -o0 -e0 kubectl logs -f "job/${job_name}" 2>>/dev/null || true | ||||||
sleep 0.2 | ||||||
local -r pod_name="$1" | ||||||
local -r container_name="$2" | ||||||
local -r log_completion_marker_file="$3" | ||||||
|
||||||
local log_snapshot="" | ||||||
# Once logs are not empty we start attempting to stream them. | ||||||
# Keep looping otherwise since empty output means that there is no useful log to display so we're not losing information by looping. | ||||||
while true; | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we include some mechanism to exit if we are not able to successfully verify logs so that we don't loop forever otherwise? |
||||||
do | ||||||
set +e | ||||||
log_snapshot="$(timeout "$log_attempt_timeout_sec" kubectl logs --limit-bytes "1024" "pod/$pod_name" --container "$container_name" 2>/dev/null)" | ||||||
set -e | ||||||
if [[ -n "$log_snapshot" ]]; then | ||||||
break | ||||||
fi | ||||||
sleep "$log_loop_retry_interval_sec" | ||||||
done | ||||||
|
||||||
# Run kubectl logs --follow in a loop since it can fail: | ||||||
# 1) It can fail due to pod not being initialized yet: "Error from server (BadRequest): container "step" in pod "somepod" is waiting to start: PodInitializing" | ||||||
# 2) It can fail mid-streaming, in this case we unfortunately will display logs multiple times (partially). | ||||||
# 3) It can hang not providing any result, that's why we check not only exit code but also contents in the loop above. | ||||||
while ! kubectl logs --follow "pod/$pod_name" --container "$container_name" 2>/dev/null; | ||||||
do | ||||||
sleep "$log_loop_retry_interval_sec" | ||||||
done | ||||||
|
||||||
echo "0" > "$log_completion_marker_file" | ||||||
} | ||||||
|
||||||
echo "--- :kubernetes: Starting Kubernetes Job" | ||||||
|
||||||
export patchFunc=${BUILDKITE_PLUGIN_K8S_PATCH:-"function(f) f"} | ||||||
|
||||||
jobspec=$(jsonnet \ | ||||||
job_spec="$(jsonnet \ | ||||||
--tla-str "jobName=${job_name}" \ | ||||||
--tla-str-file "stepEnvFile=${BUILDKITE_ENV_FILE}" \ | ||||||
--tla-code "agentEnv=$(jq -c -n env)" \ | ||||||
--tla-code patchFunc \ | ||||||
"${basedir}/lib/job.jsonnet") | ||||||
"${basedir}/lib/job.jsonnet")" | ||||||
|
||||||
export BUILDKITE_PLUGIN_K8S_USE_AGENT_NODE_AFFINITY=${BUILDKITE_PLUGIN_K8S_USE_AGENT_NODE_AFFINITY:-'false'} | ||||||
if [[ "$BUILDKITE_PLUGIN_K8S_USE_AGENT_NODE_AFFINITY" == 'true' ]]; then | ||||||
if [[ "$use_agent_node_affinity" == "true" ]]; then | ||||||
for field in affinity tolerations; do | ||||||
buildkite_agent_value=$(kubectl get pod "$(cat /etc/hostname)" -o json | jq ".spec.$field") | ||||||
jobspec=$(echo "$jobspec" | jq ".spec.template.spec.$field=$buildkite_agent_value") | ||||||
buildkite_agent_value="$(kubectl get pod "$(cat /etc/hostname)" -o json | jq ".spec.$field")" | ||||||
job_spec="$(echo "$job_spec" | jq ".spec.template.spec.$field=$buildkite_agent_value")" | ||||||
done | ||||||
fi | ||||||
|
||||||
echo "$jobspec" | kubectl apply -f - | ||||||
if [[ "$print_resulting_job_spec" == "true" ]]; then | ||||||
echo -e "Resulting k8s job spec:\n$job_spec" | ||||||
fi | ||||||
|
||||||
readonly job_apply_start_time="$SECONDS" | ||||||
job_apply_exit_code="" | ||||||
|
||||||
while [[ "$((SECONDS - job_apply_start_time))" -lt "$job_apply_timeout_sec" ]] | ||||||
do | ||||||
set +e | ||||||
echo "$job_spec" | kubectl apply -f - | ||||||
job_apply_exit_code="$?" | ||||||
set -e | ||||||
|
||||||
if [[ "$job_apply_exit_code" == "0" ]]; then | ||||||
break | ||||||
else | ||||||
echo "Attempt to apply the job failed, exit code '$job_apply_exit_code'" | ||||||
sleep "$job_apply_retry_interval_sec" | ||||||
fi | ||||||
done | ||||||
|
||||||
if [[ "$job_apply_exit_code" != "0" ]]; then | ||||||
echo "Failed to apply job, exit code '$job_apply_exit_code'" | ||||||
exit "$job_apply_exit_code" | ||||||
fi | ||||||
|
||||||
echo "Timeout: ${timeout}s" | ||||||
|
||||||
echo "+++ :kubernetes: Running image: ${BUILDKITE_PLUGIN_K8S_IMAGE}" | ||||||
echo "--- :kubernetes: Running image: ${BUILDKITE_PLUGIN_K8S_IMAGE}" | ||||||
|
||||||
counter="$timeout" | ||||||
while true | ||||||
do | ||||||
set +e | ||||||
pod_name=$(kubectl get pod -l "job-name=$job_name" --output=jsonpath="{.items[*].metadata.name}") | ||||||
set -e | ||||||
|
||||||
if [[ -n "$pod_name" ]]; then | ||||||
break | ||||||
else | ||||||
sleep "$job_status_retry_interval_sec" | ||||||
if [[ $timeout -gt 0 ]]; then | ||||||
(( counter -= job_status_retry_interval_sec )) | ||||||
fi | ||||||
fi | ||||||
done | ||||||
|
||||||
echo "Pod is running: $pod_name" | ||||||
|
||||||
echo "--- :kubernetes: bootstrap container" | ||||||
tail_logs "$pod_name" "bootstrap" "$bootstrap_container_log_complete_marker_file" | ||||||
|
||||||
tail_logs & | ||||||
echo "+++ :kubernetes: step container" | ||||||
tail_logs "$pod_name" "step" "$step_container_log_complete_marker_file" & | ||||||
|
||||||
sleeper=2 | ||||||
counter=${timeout} | ||||||
counter="$timeout" | ||||||
jobstatus="" | ||||||
while [[ -z "$jobstatus" ]] ; do | ||||||
set +e | ||||||
jobstatus=$(kubectl get job "${job_name}" -o 'jsonpath={.status.conditions[].type}') | ||||||
sleep $sleeper | ||||||
set -e | ||||||
|
||||||
if [[ -n "$jobstatus" ]]; then | ||||||
break | ||||||
else | ||||||
sleep "$job_status_retry_interval_sec" | ||||||
fi | ||||||
|
||||||
if [[ $timeout -gt 0 ]]; then | ||||||
(( counter -= sleeper )) || jobstatus="timeout" | ||||||
(( counter -= job_status_retry_interval_sec )) || jobstatus="timed-out" | ||||||
fi | ||||||
done | ||||||
|
||||||
echo | ||||||
echo "--- :kubernetes: Job status: $jobstatus" | ||||||
status=1 | ||||||
|
||||||
# Wait for logs to be fully printed, printing runs in a separate process and we're racing with it. | ||||||
readonly log_complete_start_time="$SECONDS" | ||||||
while [[ "$(cat "$step_container_log_complete_marker_file")" != "0" ]] && [[ "$((SECONDS - log_complete_start_time))" -lt "$log_complete_timeout_sec" ]] | ||||||
do | ||||||
sleep "$log_complete_retry_interval_sec" | ||||||
done | ||||||
|
||||||
status="" | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
we see issues where somehow we end up exiting with this default value:
so having this more conservative exit code is a bit nicer There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The @artem-zinnatullin I think we'd want to only set status if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||||
if [[ "$jobstatus" == "Complete" ]] ; then | ||||||
echo "success" | ||||||
status=0 | ||||||
status="0" | ||||||
else | ||||||
while true | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks like we'll keep looping if we keep failing to get the init container status? Let's break out of the loop after a hardcoded number of attempts or timeout instead? |
||||||
do | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's also print a message that says we're checking init containers status |
||||||
set +e | ||||||
pod_json=$(kubectl get pod "$pod_name" -o json) | ||||||
init_container_status="$(echo "$pod_json" | jq ".status.initContainerStatuses[0].state.terminated.exitCode")" | ||||||
if [[ -n "$init_container_status" && "$init_container_status" != "0" ]]; then | ||||||
echo "Warning: init container failed with exit code $init_container_status, this usually indicates plugin misconfiguration or infrastructure failure" | ||||||
status="$init_container_status" | ||||||
else | ||||||
status="$(echo "$pod_json" | jq ".status.containerStatuses[0].state.terminated.exitCode")" | ||||||
fi | ||||||
set -e | ||||||
if [[ -n "$status" ]]; then | ||||||
break | ||||||
else | ||||||
sleep "$job_status_retry_interval_sec" | ||||||
if [[ $timeout -gt 0 ]]; then | ||||||
(( counter -= job_status_retry_interval_sec )) | ||||||
fi | ||||||
fi | ||||||
done | ||||||
Comment on lines
+179
to
+199
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Awesome work checking init containers too 🙌 |
||||||
fi | ||||||
|
||||||
if [[ -z "$status" ]]; then | ||||||
echo "Warning: could not get actual exit code of the job" | ||||||
status="42" | ||||||
fi | ||||||
|
||||||
exit $status | ||||||
exit "$status" |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,6 +8,7 @@ local allowedEnvs = std.set( | |
'BUILDKITE_MESSAGE', | ||
'BUILDKITE_BUILD_CREATOR', | ||
'BUILDKITE_BUILD_CREATOR_EMAIL', | ||
'BUILDKITE_BUILD_ID', | ||
'BUILDKITE_BUILD_NUMBER', | ||
'BUILDKITE_BUILD_PATH', | ||
'BUILDKITE_BUILD_URL', | ||
|
@@ -35,10 +36,11 @@ local numberSuffix(s) = | |
|
||
local labelChars = std.set(std.stringChars('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-_.')); | ||
local labelValue(s) = | ||
std.join('', [ | ||
local sanitizedValue = std.join('', [ | ||
if std.setMember(c, labelChars) then c else '_' | ||
for c in std.stringChars(s) | ||
]); | ||
if std.length(sanitizedValue) < 63 then sanitizedValue else std.substr(sanitizedValue, 0, 63); | ||
|
||
function(jobName, agentEnv={}, stepEnvFile='', patchFunc=identity) patchFunc({ | ||
local buildSubPath = std.join('/', [ | ||
|
@@ -69,6 +71,7 @@ function(jobName, agentEnv={}, stepEnvFile='', patchFunc=identity) patchFunc({ | |
BUILDKITE_PLUGIN_K8S_RESOURCES_REQUEST_MEMORY: '', | ||
BUILDKITE_PLUGIN_K8S_RESOURCES_LIMIT_MEMORY: '', | ||
BUILDKITE_PLUGIN_K8S_WORKDIR: std.join('/', [env.BUILDKITE_BUILD_PATH, buildSubPath]), | ||
BUILDKITE_PLUGIN_K8S_JOB_TTL_SECONDS_AFTER_FINISHED: '86400', | ||
artem-zinnatullin marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm really bad at Jsonnet, I've declared default value in YAML, but not sure how to keep this declaration in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks correct! We have these as default values then we override them with any values provided by the pipeline https://github.com/EmbarkStudios/k8s-buildkite-plugin/pull/14/files#diff-d8106e0da5023a3b7eb0401c475f88b2R75 |
||
} + agentEnv, | ||
|
||
local stepEnv = | ||
|
@@ -141,7 +144,6 @@ function(jobName, agentEnv={}, stepEnvFile='', patchFunc=identity) patchFunc({ | |
'build/creator-email': env.BUILDKITE_BUILD_CREATOR_EMAIL, | ||
'build/id': env.BUILDKITE_BUILD_ID, | ||
'build/url': env.BUILDKITE_BUILD_URL, | ||
'build/message': env.BUILDKITE_MESSAGE, | ||
'build/number': env.BUILDKITE_BUILD_NUMBER, | ||
'build/organization': env.BUILDKITE_ORGANIZATION_SLUG, | ||
'build/repo': env.BUILDKITE_REPO, | ||
|
@@ -292,6 +294,7 @@ function(jobName, agentEnv={}, stepEnvFile='', patchFunc=identity) patchFunc({ | |
backoffLimit: 0, | ||
activeDeadlineSeconds: deadline, | ||
completions: 1, | ||
ttlSecondsAfterFinished: std.parseInt(env.BUILDKITE_PLUGIN_K8S_JOB_TTL_SECONDS_AFTER_FINISHED), | ||
template: { | ||
metadata: { | ||
labels: labels, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not include a gitignore in the PR since its possible to have git exclude these files locally https://stackoverflow.com/questions/1753070/how-do-i-configure-git-to-ignore-some-files-locally