Skip to content

Commit

Permalink
Merge pull request #1193 from stanislav-zaprudskiy/add_termination_gr…
Browse files Browse the repository at this point in the history
…ace_period_seconds

AWX: Add `termination_grace_period_seconds`
  • Loading branch information
TheRealHaoLiu authored Feb 28, 2023
2 parents b5f255c + 49d1f00 commit 46da413
Show file tree
Hide file tree
Showing 13 changed files with 376 additions and 58 deletions.
41 changes: 41 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ An [Ansible AWX](https://github.com/ansible/awx) operator for Kubernetes built w
* [Upgrade of instances without auto upgrade](#upgrade-of-instances-without-auto-upgrade)
* [Service Account](#service-account)
* [Labeling operator managed objects](#labeling-operator-managed-objects)
* [Pods termination grace period](#pods-termination-grace-period)
* [Uninstall](#uninstall)
* [Upgrading](#upgrading)
* [Backup](#backup)
Expand Down Expand Up @@ -1246,6 +1247,46 @@ spec:
...
```

#### Pods termination grace period

During deployment restarts or new rollouts, when old ReplicaSet Pods are being
terminated, the corresponding jobs which are managed (executed or controlled)
by old AWX Pods may end up in `Error` state as there is no mechanism to
transfer them to the newly spawned AWX Pods. To work around the problem one
could set `termination_grace_period_seconds` in AWX spec, which does the
following:

* It sets the corresponding
[`terminationGracePeriodSeconds`](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination)
Pod spec of the AWX Deployment to the value provided

> The grace period is the duration in seconds after the processes running in
> the pod are sent a termination signal and the time when the processes are
> forcibly halted with a kill signal

* It adds a
[`PreStop`](https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#hook-handler-execution)
hook script, which will keep AWX Pods in terminating state until it finished,
up to `terminationGracePeriodSeconds`.

> This grace period applies to the total time it takes for both the PreStop
> hook to execute and for the Container to stop normally

While the hook script just waits until the corresponding AWX Pod (instance)
no longer has any managed jobs, in which case it finishes with success and
hands over the overall Pod termination process to normal AWX processes.

One may want to set this value to the maximum duration they accept to wait for
the affected Jobs to finish. Keeping in mind that such finishing jobs may
increase Pods termination time in such situations as `kubectl rollout restart`,
AWX upgrade by the operator, or Kubernetes [API-initiated
evictions](https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/).


| Name | Description | Default |
| -------------------------------- | --------------------------------------------------------------- | ------- |
| termination_grace_period_seconds | Optional duration in seconds pods needs to terminate gracefully | not set |

### Uninstall ###

To uninstall an AWX deployment instance, you basically need to remove the AWX kind related to that instance. For example, to delete an AWX instance named awx-demo, you would do:
Expand Down
4 changes: 4 additions & 0 deletions config/crd/bases/awx.ansible.com_awxs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -525,6 +525,10 @@ spec:
type: array
type: object
type: object
termination_grace_period_seconds:
description: Optional duration in seconds pods needs to terminate gracefully
type: integer
format: int32
service_labels:
description: Additional labels to apply to the service
type: string
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -622,6 +622,11 @@ spec:
x-descriptors:
- urn:alm:descriptor:com.tectonic.ui:advanced
- urn:alm:descriptor:com.tectonic.ui:hidden
- displayName: Termination Grace Period Seconds
path: termination_grace_period_seconds
x-descriptors:
- urn:alm:descriptor:com.tectonic.ui:advanced
- urn:alm:descriptor:com.tectonic.ui:hidden
- displayName: Service Labels
path: service_labels
x-descriptors:
Expand Down
18 changes: 9 additions & 9 deletions roles/backup/tasks/creation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,22 +32,22 @@
- this_backup['resources'][0]['metadata']['labels']

- block:
- include_tasks: init.yml
- include_tasks: init.yml

- include_tasks: postgres.yml
- include_tasks: postgres.yml

- include_tasks: awx-cro.yml
- include_tasks: awx-cro.yml

- include_tasks: secrets.yml
- include_tasks: secrets.yml

- name: Set flag signifying this backup was successful
set_fact:
backup_complete: true
- name: Set flag signifying this backup was successful
set_fact:
backup_complete: true

- include_tasks: cleanup.yml
- include_tasks: cleanup.yml

when:
- this_backup['resources'][0]['status']['backupDirectory'] is not defined
- this_backup['resources'][0]['status']['backupDirectory'] is not defined

- name: Update status variables
include_tasks: update_status.yml
66 changes: 66 additions & 0 deletions roles/installer/files/pre-stop/termination-env
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# file, which when exists, indicates that `master` script has successfully
# completed pre-stop script execution
marker_file="${PRE_STOP_MARKER_FILE:-/var/lib/pre-stop/.termination_marker}"

# file which the running `master` script continuously updates (mtime) to
# indicate it's still running. this file is then read by `watcher`s to
# understand if they still have to wait for `termination_marker`
heartbeat_file="${PRE_STOP_HEARTBEAT_FILE:-/var/lib/pre-stop/.heartbeat}"

# file which:
# * `watcher`s create when they bail out because they didn't see the
# `heartbeat_file` to be updated within `$heartbeat_failed_threshold`;
# * `master` creates when its handler command fails;
# when scripts see such file, they also give up
bailout_file="${PRE_STOP_BAILOUT_FILE:-/var/lib/pre-stop/.bailout}"
heartbeat_threshold="${PRE_STOP_HEARTBEAT_THRESHOLD:-60}"

# where the scripts' stdout/stderr are streamed
stdout="${PRE_STOP_STDOUT:-/proc/1/fd/1}"
stderr="${PRE_STOP_STDERR:-/proc/1/fd/2}"

# command the `master` script executes, which when successfully finishes,
# causes the script to create the `marker_file`
handler="${PRE_STOP_HANDLER:-bash -c \"PYTHONUNBUFFERED=x awx-manage disable_instance --wait --retry=inf\"}"

log_prefix="${PRE_STOP_LOG_PREFIX:-preStop.exec}"
[[ -n ${PRE_STOP_LOG_ROLE} ]] && log_prefix="${log_prefix}] [$PRE_STOP_LOG_ROLE"

# interval at which `watcher`s check for `marker_file` presence
recheck_sleep="${PRE_STOP_RECHECK_SLEEP:-1}"
# interval at which `watcher`s report into $stdout that they are still watching
report_every="${PRE_STOP_REPORT_EVERY:-30}"

function log {
printf "[%s] $1\n" "$log_prefix" "${@:2}"
}

function parameters_string {
for param in "$@"; do
printf "%s=\"%s\"\n" "$param" "${!param}"
done | paste -s -d ' '
}

function check_bailout {
if [[ -f $bailout_file ]]; then
log "\"%s\" file has been detected, accepting bail out signal and failing the hook script" \
"$bailout_file"
exit 1
fi
}

function check_heartbeat {
if [[ -f $heartbeat_file ]]; then
delta=$(( $(date +%s) - $(stat -c %Y "$heartbeat_file") ))
else
delta=$(( $(date +%s) - $1 ))
fi

if [[ $delta -gt $heartbeat_threshold ]]; then
log "The heartbeat file hasn't been updated since %ss, which is above the threshold of %ds, assuming the master is not operating and failing the hook script" \
$delta
$heartbeat_threshold
touch "$bailout_file"
exit 1
fi
}
50 changes: 50 additions & 0 deletions roles/installer/files/pre-stop/termination-master
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
#/usr/bin/env bash

PRE_STOP_LOG_ROLE="${PRE_STOP_LOG_ROLE:-master}"
source $(dirname "$0")/termination-env

{

log "The hook has started: %s" \
"$(parameters_string \
"marker_file" \
"heartbeat_file" \
"bailout_file" \
"handler" \
)"

touch "$heartbeat_file"

set -o pipefail
eval "$handler" 2>&1 | while IFS= read -r line; do
# we check the files here and break early, but overall script termination
# happens later - as we need to distinguish between files detection and
# command failure, while bash doesn't offer a simple way to do this here
# inside the loop (`exit` does not terminate the script)
[[ -f $bailout_file ]] && break
[[ -f $marker_file ]] && break

log "[handler] %s" "$line"
touch "$heartbeat_file"
done
ec=$?
set +o pipefail

# process various cases in specific order
check_bailout

if [[ -f $marker_file ]]; then
log "Done! The marker file has been detected, assuming some other instance of the script has run to completion"
exit 0
elif [[ $ec -ne 0 ]]; then
log "The handler has failed with \"%d\" exit code, failing the hook script too" \
$ec
# signal others to bail out
touch "$bailout_file"
exit $ec
else
log "Done! Generating the marker file allowing to proceed to termination"
touch "$marker_file"
fi

} > "$stdout" 2> "$stderr"
33 changes: 33 additions & 0 deletions roles/installer/files/pre-stop/termination-waiter
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#/usr/bin/env bash

PRE_STOP_LOG_ROLE="${PRE_STOP_LOG_ROLE:-waiter}"
source $(dirname "$0")/termination-env

{

log "The hook has started: %s" \
"$(parameters_string \
"marker_file" \
"heartbeat_file" \
"bailout_file" \
"recheck_sleep" \
"report_every" \
)"

n=0
checks_started=$(date +%s)

while ! [[ -f $marker_file ]]; do
check_bailout
check_heartbeat $checks_started

if [[ $(($n % $report_every)) -eq 0 ]]; then
log "Waiting for the marker file to be accessible..."
fi
n=$(($n + 1))
sleep $recheck_sleep
done

log "The marker file found, exiting to proceed to termination"

} > "$stdout" 2> "$stderr"
14 changes: 7 additions & 7 deletions roles/installer/tasks/install.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,17 +39,17 @@
- name: Load LDAP CAcert certificate
include_tasks: load_ldap_cacert_secret.yml
when:
- ldap_cacert_secret != ''
- ldap_cacert_secret != ''

- name: Load ldap bind password
include_tasks: load_ldap_password_secret.yml
when:
- ldap_password_secret != ''
- ldap_password_secret != ''

- name: Load bundle certificate authority certificate
include_tasks: load_bundle_cacert_secret.yml
when:
- bundle_cacert_secret != ''
- bundle_cacert_secret != ''

- name: Include admin password configuration tasks
include_tasks: admin_password_configuration.yml
Expand All @@ -66,8 +66,8 @@
- name: Load Route TLS certificate
include_tasks: load_route_tls_secret.yml
when:
- ingress_type | lower == 'route'
- route_tls_secret != ''
- ingress_type | lower == 'route'
- route_tls_secret != ''

- name: Include resources configuration tasks
include_tasks: resources_configuration.yml
Expand All @@ -91,8 +91,8 @@
bash -c "awx-manage migrate --noinput"
register: migrate_result
when:
- database_check is defined
- (database_check.stdout|trim) != '0'
- database_check is defined
- (database_check.stdout|trim) != '0'

- name: Initialize Django
include_tasks: initialize_django.yml
Expand Down
Loading

0 comments on commit 46da413

Please sign in to comment.