Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine deployment rollouts #1222

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions roles/backup/tasks/creation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,22 +32,22 @@
- this_backup['resources'][0]['metadata']['labels']

- block:
- include_tasks: init.yml
- include_tasks: init.yml

- include_tasks: postgres.yml
- include_tasks: postgres.yml

- include_tasks: awx-cro.yml
- include_tasks: awx-cro.yml

- include_tasks: secrets.yml
- include_tasks: secrets.yml

- name: Set flag signifying this backup was successful
set_fact:
backup_complete: true
- name: Set flag signifying this backup was successful
set_fact:
backup_complete: true

- include_tasks: cleanup.yml
- include_tasks: cleanup.yml

when:
- this_backup['resources'][0]['status']['backupDirectory'] is not defined
- this_backup['resources'][0]['status']['backupDirectory'] is not defined

- name: Update status variables
include_tasks: update_status.yml
14 changes: 7 additions & 7 deletions roles/installer/tasks/install.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,17 +39,17 @@
- name: Load LDAP CAcert certificate
include_tasks: load_ldap_cacert_secret.yml
when:
- ldap_cacert_secret != ''
- ldap_cacert_secret != ''

- name: Load ldap bind password
include_tasks: load_ldap_password_secret.yml
when:
- ldap_password_secret != ''
- ldap_password_secret != ''

- name: Load bundle certificate authority certificate
include_tasks: load_bundle_cacert_secret.yml
when:
- bundle_cacert_secret != ''
- bundle_cacert_secret != ''

- name: Include admin password configuration tasks
include_tasks: admin_password_configuration.yml
Expand All @@ -66,8 +66,8 @@
- name: Load Route TLS certificate
include_tasks: load_route_tls_secret.yml
when:
- ingress_type | lower == 'route'
- route_tls_secret != ''
- ingress_type | lower == 'route'
- route_tls_secret != ''

- name: Include resources configuration tasks
include_tasks: resources_configuration.yml
Expand All @@ -91,8 +91,8 @@
bash -c "awx-manage migrate --noinput"
register: migrate_result
when:
- database_check is defined
- (database_check.stdout|trim) != '0'
- database_check is defined
- (database_check.stdout|trim) != '0'

- name: Initialize Django
include_tasks: initialize_django.yml
Expand Down
90 changes: 61 additions & 29 deletions roles/installer/tasks/resources_configuration.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,17 @@
- status.phase=Running
register: tower_pod

- name: Set the resource pod as a variable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will prevent us from grabbing a pod that is in terminating state. It also ensures that only one pod is grabbed (the oldest).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding the corresponding commit message in case it got missed

commit b3a7436

Make AWX Pod variable to be calculated respecting `creationTimestamp` and `deletionTimestamp`

Do not consider Pods marked for deletion when calculating tower_pod to
address replicas scale down case - where normally Pods spawned recently
are being taken for removal. As well as the case when operator kicked
off but some old replicas are still terminating.

Respect `creationTimestamp` so to make sure that the newest Pod is taken
after Deployment application, in which case multiple RS Pods (from old
RS and new RS) could be running simultaneously while the rollout is
happening.

I've also encountered a couple of cases when Terminating pod was picked up, and running awx-manage command on it during the later tasks was not possible (the pod was gone already, etc).

Here indeed the oldest pod is taken, while later in the code the newest (the one most recently created) is used.

set_fact:
tower_pod: >-
{{ tower_pod['resources']
| rejectattr('metadata.deletionTimestamp', 'defined')
| sort(attribute='metadata.creationTimestamp')
| first | default({}) }}

- name: Set the resource pod name as a variable.
set_fact:
tower_pod_name: "{{ tower_pod['resources'][0]['metadata']['name'] | default('') }}"
tower_pod_name: "{{ tower_pod['metadata']['name'] | default('') }}"

- name: Set user provided control plane ee image
set_fact:
Expand All @@ -32,13 +40,13 @@
kind: Secret
namespace: '{{ ansible_operator_meta.namespace }}'
name: '{{ ansible_operator_meta.name }}-receptor-ca'
register: _receptor_ca
register: receptor_ca
no_log: "{{ no_log }}"

- name: Migrate Receptor CA Secret
when:
- _receptor_ca['resources'] | default([]) | length
- _receptor_ca['resources'][0]['type'] != "kubernetes.io/tls"
- receptor_ca['resources'] | default([]) | length
- receptor_ca['resources'][0]['type'] != "kubernetes.io/tls"
block:
- name: Delete old Receptor CA Secret
k8s:
Expand All @@ -53,7 +61,7 @@
register: _receptor_ca_key_file
- name: Copy Receptor CA key from old secret to tempfile
copy:
content: "{{ _receptor_ca['resources'][0]['data']['receptor-ca.key'] | b64decode }}"
content: "{{ receptor_ca['resources'][0]['data']['receptor-ca.key'] | b64decode }}"
dest: "{{ _receptor_ca_key_file.path }}"
no_log: "{{ no_log }}"
- name: Create tempfile for receptor-ca.crt
Expand All @@ -63,14 +71,25 @@
register: _receptor_ca_crt_file
- name: Copy Receptor CA cert from old secret to tempfile
copy:
content: "{{ _receptor_ca['resources'][0]['data']['receptor-ca.crt'] | b64decode }}"
content: "{{ receptor_ca['resources'][0]['data']['receptor-ca.crt'] | b64decode }}"
dest: "{{ _receptor_ca_crt_file.path }}"
no_log: "{{ no_log }}"
- name: Create New Receptor CA secret
k8s:
apply: true
definition: "{{ lookup('template', 'secrets/receptor_ca_secret.yaml.j2') }}"
no_log: "{{ no_log }}"
- name: Read New Receptor CA Secret
k8s_info:
kind: Secret
namespace: '{{ ansible_operator_meta.namespace }}'
name: '{{ ansible_operator_meta.name }}-receptor-ca'
register: _receptor_ca
no_log: "{{ no_log }}"
- name: Set receptor_ca variable
set_fact:
receptor_ca: '{{ _receptor_ca }}'
no_log: "{{ no_log }}"
- name: Remove tempfiles
file:
path: "{{ item }}"
Expand Down Expand Up @@ -106,21 +125,32 @@
apply: true
definition: "{{ lookup('template', 'secrets/receptor_ca_secret.yaml.j2') }}"
no_log: "{{ no_log }}"
- name: Read Receptor CA secret
k8s_info:
kind: Secret
namespace: '{{ ansible_operator_meta.namespace }}'
name: '{{ ansible_operator_meta.name }}-receptor-ca'
register: _receptor_ca
no_log: "{{ no_log }}"
- name: Set receptor_ca variable
set_fact:
receptor_ca: '{{ _receptor_ca }}'
no_log: "{{ no_log }}"
- name: Remove tempfiles
file:
path: "{{ item }}"
state: absent
loop:
- "{{ _receptor_ca_key_file.path }}"
- "{{ _receptor_ca_crt_file.path }}"
when: not _receptor_ca['resources'] | default([]) | length
when: not receptor_ca['resources'] | default([]) | length

- name: Check for Receptor work signing Secret
k8s_info:
kind: Secret
namespace: '{{ ansible_operator_meta.namespace }}'
name: '{{ ansible_operator_meta.name }}-receptor-work-signing'
register: _receptor_work_signing
register: receptor_work_signing
no_log: "{{ no_log }}"

- name: Generate Receptor work signing RSA key pair
Expand Down Expand Up @@ -151,21 +181,31 @@
apply: true
definition: "{{ lookup('template', 'secrets/receptor_work_signing_secret.yaml.j2') }}"
no_log: "{{ no_log }}"
- name: Read Receptor work signing Secret
k8s_info:
kind: Secret
namespace: '{{ ansible_operator_meta.namespace }}'
name: '{{ ansible_operator_meta.name }}-receptor-work-signing'
register: _receptor_work_signing
no_log: "{{ no_log }}"
- name: Set receptor_work_signing variable
set_fact:
receptor_work_signing: '{{ _receptor_work_signing }}'
no_log: "{{ no_log }}"
- name: Remove tempfiles
file:
path: "{{ item }}"
state: absent
loop:
- "{{ _receptor_work_signing_private_key_file.path }}"
- "{{ _receptor_work_signing_public_key_file.path }}"
when: not _receptor_work_signing['resources'] | default([]) | length
when: not receptor_work_signing['resources'] | default([]) | length

- name: Apply Resources
k8s:
apply: yes
definition: "{{ lookup('template', item + '.yaml.j2') }}"
wait: yes
register: tower_resources_result
loop:
- 'configmaps/config'
- 'secrets/app_credentials'
Expand Down Expand Up @@ -210,21 +250,10 @@
apply: yes
definition: "{{ lookup('template', 'deployments/deployment.yaml.j2') }}"
wait: yes
wait_timeout: "{{ 120 * replicas or 120 }}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stanislav-zaprudskiy could you expand on why you added this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the corresponding commit message just in case it got missed

commit e589ceb

When applying Deployment wait up to (timeout * replicas)

There are cases when having a new Deployment may be taking above the
default timeout of 120s.
For instance, when a Deployment has multiple replicas, and each replica
starts on a separate node, and the Deployment specifies new images, then
just pulling these new images for each replica may be taking above the
default timeout of 120s.

Having the default time multiplied by the number of replicas should
provide generally enough time for all replicas to start

The corresponding parameter wait: yes (already provided) causes the task to wait until Deployment pods are ready, and with the default 120 seconds it could fail depending on the overall k8s and AWX configurations, causing the operator run to fail and be started again. Starting a new run won't solve the problem however, as it would run into the same timeout issue again and again until the new pods are ready - so by increasing maximum waiting time originally it saves from further failures and restarts.

Just to add, also the Deployment's rollout strategy configuration could increase the time for having the new pods ready.

There could be cases when scheduling new pods may not be possible (e.g. due to lack of resources, or wrong image configuration, etc) - but the operator would stuck (unnecessary) waiting. Having lower timeout value would make users aware of the problem earlier - but in multi-replica AWX configurations running in multi-node clusters where image caches aren't generally available, the lower timeout values provide too much false-positive failures of operator runs.

register: this_deployment_result

- block:
- name: Delete pod to reload a resource configuration
k8s:
api_version: v1
state: absent
kind: Pod
namespace: '{{ ansible_operator_meta.namespace }}'
name: '{{ tower_pod_name }}'
wait: yes
when:
- tower_resources_result.changed
- tower_pod_name | length

- name: Get the new resource pod information after updating resource.
k8s_info:
kind: Pod
Expand All @@ -236,17 +265,20 @@
field_selectors:
- status.phase=Running
register: _new_pod
until:
- _new_pod['resources'] | length
- _new_pod['resources'][0]['metadata']['name'] != tower_pod_name
delay: 5
retries: 60

- name: Update new resource pod as a variable.
set_fact:
tower_pod: >-
{{ _new_pod['resources']
| rejectattr('metadata.deletionTimestamp', 'defined')
| sort(attribute='metadata.creationTimestamp')
| last | default({}) }}

- name: Update new resource pod name as a variable.
set_fact:
tower_pod_name: '{{ _new_pod["resources"][0]["metadata"]["name"] }}'
tower_pod_name: '{{ tower_pod["metadata"]["name"] | default("")}}'
when:
- tower_resources_result.changed or this_deployment_result.changed
- this_deployment_result.changed

- name: Verify the resource pod name is populated.
assert:
Expand Down
4 changes: 2 additions & 2 deletions roles/installer/tasks/secret_key_configuration.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,10 @@

- name: Set secret key secret
set_fact:
__secret_key_secret: '{{ _generated_secret_key["resources"] | default([]) | length | ternary(_generated_secret_key, _secret_key_secret) }}'
secret_key: '{{ _generated_secret_key["resources"] | default([]) | length | ternary(_generated_secret_key, _secret_key_secret) }}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to do some testing around this before merging. cc @TheRealHaoLiu

no_log: "{{ no_log }}"

- name: Store secret key secret name
set_fact:
secret_key_secret_name: "{{ __secret_key_secret['resources'][0]['metadata']['name'] }}"
secret_key_secret_name: "{{ secret_key['resources'][0]['metadata']['name'] }}"
no_log: "{{ no_log }}"
19 changes: 18 additions & 1 deletion roles/installer/templates/deployments/deployment.yaml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,25 @@ spec:
labels:
{{ lookup("template", "../common/templates/labels/common.yaml.j2") | indent(width=8) | trim }}
{{ lookup("template", "../common/templates/labels/version.yaml.j2") | indent(width=8) | trim }}
{% if annotations %}
annotations:
{% for template in [
"configmaps/config",
"secrets/app_credentials",
"storage/persistent",
] %}
checksum-{{ template | replace('/', '-') }}: "{{ lookup('template', template + '.yaml.j2') | md5 }}"
Copy link
Member

@rooftopcellist rooftopcellist Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ This is a great trick! This is an elegant solution to the issue of deployments not being cycled when changes are made to the ConfigMap. @TheRealHaoLiu is impressed too.

{% endfor %}
{% for secret in [
"bundle_cacert",
"route_tls",
"ldap_cacert",
"secret_key",
"receptor_ca",
"receptor_work_signing",
] %}
checksum-secret-{{ secret }}: "{{ lookup('ansible.builtin.vars', secret, default='')["resources"][0]["data"] | default('') | md5 }}"
{% endfor %}
{% if annotations %}
{{ annotations | indent(width=8) }}
{% endif %}
spec:
Expand Down
20 changes: 10 additions & 10 deletions roles/restore/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,24 +32,24 @@
- this_restore['resources'][0]['metadata']['labels']

- block:
- include_tasks: init.yml
- include_tasks: init.yml

- include_tasks: import_vars.yml
- include_tasks: import_vars.yml

- include_tasks: secrets.yml
- include_tasks: secrets.yml

- include_tasks: deploy_awx.yml
- include_tasks: deploy_awx.yml

- include_tasks: postgres.yml
- include_tasks: postgres.yml

- name: Set flag signifying this restore was successful
set_fact:
tower_restore_complete: True
- name: Set flag signifying this restore was successful
set_fact:
tower_restore_complete: True

- include_tasks: cleanup.yml
- include_tasks: cleanup.yml

when:
- this_restore['resources'][0]['status']['restoreComplete'] is not defined
- this_restore['resources'][0]['status']['restoreComplete'] is not defined

- name: Update status variables
include_tasks: update_status.yml