Skip to content

Commit

Permalink
Upgrade and rollback Improvements (#608)
Browse files Browse the repository at this point in the history
  • Loading branch information
vitabaks authored Mar 26, 2024
1 parent ceb630c commit 0a09ef1
Show file tree
Hide file tree
Showing 9 changed files with 103 additions and 14 deletions.
13 changes: 8 additions & 5 deletions roles/upgrade/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,8 @@ On average, the PgBouncer pause duration is approximately 30 seconds. However, f

This playbook performs a rollback of a PostgreSQL upgrade.

Note: In some scenarios, if errors occur, the pg_upgrade.yml playbook may automatically initiate a rollback. Alternatively, if the automatic rollback does not occur, you can manually execute the pg_upgrade_rollback.yml playbook to revert the changes.

```bash
ansible-playbook pg_upgrade_rollback.yml
```
Expand Down Expand Up @@ -182,7 +184,8 @@ Please see the variable file vars/[upgrade.yml](../../vars/upgrade.yml)
- Print the result of the pg_upgrade check

#### 4. PRE-UPGRADE: Prepare the Patroni configuration
- Edit patroni.yml
- Backup the patroni.yml configuration file
- Edit the patroni.yml configuration file
- **Update parameters**: `data_dir`, `bin_dir`, `config_dir`
- **Check if the 'standby_cluster' parameter is specified**
- Remove parameters: `standby_cluster` (if exists)
Expand Down Expand Up @@ -226,7 +229,7 @@ Please see the variable file vars/[upgrade.yml](../../vars/upgrade.yml)
- Notes: max wait time: 2 minutes
- Stop, if replication lag is high
- Perform rollback
- Print error message: "There's a replication lag in the PostgreSQL Cluster. Please try again later"
- Print error message: "There's a replication lag in the PostgreSQL Cluster. Please try again later"
- **Perform PAUSE on all pgbouncers servers**
- Notes: if 'pgbouncer_install' is 'true' and 'pgbouncer_pool_pause' is 'true'
- Notes: pgbouncer pause script (details in [pgbouncer_pause.yml](tasks/pgbouncer_pause.yml)) performs the following actions:
Expand All @@ -236,7 +239,7 @@ Please see the variable file vars/[upgrade.yml](../../vars/upgrade.yml)
- If active queries do not complete within 30 seconds (`pgbouncer_pool_pause_terminate_after` variable), the script terminates slow active queries (longer than `pg_slow_active_query_treshold_to_terminate`).
- If after that it is still not possible to pause the pgbouncer servers within 60 seconds (`pgbouncer_pool_pause_stop_after` variable) from the start of the script, the script exits with an error.
- Perform rollback
- Print error message: "PgBouncer pools could not be paused, please try again later."
- Print error message: "PgBouncer pools could not be paused, please try again later."
- **Stop PostgreSQL** on the Leader and Replicas
- Check if old PostgreSQL is stopped
- Check if new PostgreSQL is stopped
Expand All @@ -248,9 +251,9 @@ Please see the variable file vars/[upgrade.yml](../../vars/upgrade.yml)
- "'Latest checkpoint location' is the same on the leader and its standbys"
- if 'Latest checkpoint location' values doesn't match
- Perform rollback
- Stop with error message:
- "Latest checkpoint location' doesn't match on leader and its standbys. Please try again later"
- Print error message: "Latest checkpoint location' doesn't match on leader and its standbys. Please try again later"
- **Upgrade the PostgreSQL on the Primary** (using pg_upgrade --link)
- Perform rollback, if the upgrade failed
- Print the result of the pg_upgrade
- **Make sure that the new data directory are empty on the Replica**
- **Upgrade the PostgreSQL on the Replica** (using rsync --hard-links)
Expand Down
5 changes: 5 additions & 0 deletions roles/upgrade/tasks/extensions.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,11 @@
{{ pg_new_bindir }}/psql -p {{ postgresql_port }} -U {{ patroni_superuser_username }} -d postgres -tAXc
"select datname from pg_catalog.pg_database where datname <> 'template0'"
register: databases_list
until: databases_list is success
delay: 5
retries: 3
changed_when: false
ignore_errors: true # show the error and continue the playbook execution
when:
- inventory_hostname in groups['primary']

Expand All @@ -15,6 +19,7 @@
loop_control:
loop_var: pg_target_dbname
when:
- databases_list is success
- databases_list.stdout_lines is defined
- databases_list.stdout_lines | length > 0

Expand Down
9 changes: 9 additions & 0 deletions roles/upgrade/tasks/post_checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,11 @@
{{ pg_new_bindir }}/psql -p {{ postgresql_port }} -U {{ patroni_superuser_username }} -d postgres -tAXc
"drop table IF EXISTS test_replication;
create table test_replication as select generate_series(1, 10000)"
register: create_table_result
until: create_table_result is success
delay: 5
retries: 3
ignore_errors: true # show the error and continue the playbook execution
when:
- inventory_hostname in groups['primary']

Expand All @@ -46,13 +51,15 @@
failed_when: false
when:
- inventory_hostname in groups['secondary']
- create_table_result is success

- name: Drop a table "test_replication"
ansible.builtin.command: >-
{{ pg_new_bindir }}/psql -p {{ postgresql_port }} -U {{ patroni_superuser_username }} -d postgres -tAXc
"drop table IF EXISTS test_replication"
when:
- inventory_hostname in groups['primary']
- create_table_result is success

- name: Print the result of checking the number of records
ansible.builtin.debug:
Expand All @@ -61,6 +68,7 @@
- "The number of records in the test_replication table the same as the Primary ({{ count_test.stdout }} rows)"
when:
- inventory_hostname in groups['secondary']
- count_test.stdout is defined
- count_test.stdout | int == 10000

# Error, if the number of records in the "test_replication" table does not match the Primary.
Expand All @@ -74,6 +82,7 @@
ignore_errors: true # show the error and continue the playbook execution
when:
- inventory_hostname in groups['secondary']
- count_test.stdout is defined
- count_test.stdout | int != 10000

...
25 changes: 18 additions & 7 deletions roles/upgrade/tasks/post_upgrade.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,20 @@
ansible.builtin.command: >-
{{ pg_new_bindir }}/psql -p {{ postgresql_port }} -U {{ patroni_superuser_username }} -d postgres -tAXc
"show data_directory"
changed_when: false
register: pg_current_datadir
until: pg_current_datadir is success
delay: 5
retries: 3
changed_when: false
ignore_errors: true # show the error and continue the playbook execution

# RedHat based
- name: Delete the old PostgreSQL data directory
ansible.builtin.file:
path: "{{ pg_old_datadir }}"
state: absent
when:
- pg_current_datadir is success
- pg_new_datadir == pg_current_datadir.stdout | trim
- ansible_os_family == "RedHat"

Expand All @@ -22,6 +27,7 @@
/usr/bin/pg_dropcluster {{ pg_old_version }} {{ postgresql_cluster_name }}
failed_when: false
when:
- pg_current_datadir is success
- pg_new_datadir == pg_current_datadir.stdout | trim
- ansible_os_family == "Debian"

Expand All @@ -31,6 +37,7 @@
path: "{{ postgresql_wal_dir | regex_replace('(/$)', '') | regex_replace(postgresql_version, pg_old_version) }}"
state: absent
when:
- pg_current_datadir is success
- postgresql_wal_dir | length > 0
- pg_new_wal_dir | length > 0

Expand All @@ -46,7 +53,7 @@
until: package_remove is success
delay: 5
retries: 3
ignore_errors: true
ignore_errors: true # show the error and continue the playbook execution
when:
- item is search(pg_old_version)
- pg_old_packages_remove | bool
Expand All @@ -65,7 +72,7 @@
until: apt_remove is success
delay: 5
retries: 3
ignore_errors: true
ignore_errors: true # show the error and continue the playbook execution
when:
- item is search(pg_old_version)
- pg_old_packages_remove | bool
Expand All @@ -81,6 +88,7 @@

- name: Update the PostgreSQL configuration
ansible.builtin.command: "{{ pg_new_bindir }}/pg_ctl reload -D {{ pg_new_datadir }}"
ignore_errors: true # show the error and continue the playbook execution
when:
- socket_access_result.stderr is defined
- "'no pg_hba.conf entry' in socket_access_result.stderr"
Expand Down Expand Up @@ -108,7 +116,7 @@
when: pg_path_count.stdout | length > 0 and pgbackrest_stanza_upgrade | bool
become: true
become_user: postgres
ignore_errors: true
ignore_errors: true # show the error and continue the playbook execution
when:
- pgbackrest_install | bool
- pgbackrest_repo_host | length < 1
Expand Down Expand Up @@ -142,7 +150,7 @@
when: pg_path_count.stdout | length > 0 and pgbackrest_stanza_upgrade | bool
become: true
become_user: "{{ pgbackrest_repo_user }}"
ignore_errors: true
ignore_errors: true # show the error and continue the playbook execution
when:
- pgbackrest_install | bool
- pgbackrest_repo_host | length > 0
Expand All @@ -162,7 +170,7 @@
replace: "{{ postgresql_data_dir | regex_replace(postgresql_version, pg_new_version) }}"
become: true
become_user: root
ignore_errors: true
ignore_errors: true # show the error and continue the playbook execution
when: wal_g_install | bool

# Wait for the analyze to complete
Expand Down Expand Up @@ -190,7 +198,7 @@
done < /tmp/pg_terminator.pid
args:
executable: /bin/bash
ignore_errors: true
ignore_errors: true # show the error and continue the playbook execution
when: (pg_terminator_analyze is defined and pg_terminator_analyze is changed) or
(pg_terminator_long_transactions is defined and pg_terminator_long_transactions is changed)

Expand All @@ -212,6 +220,9 @@
{{ pg_new_bindir }}/psql -p {{ postgresql_port }} -U {{ patroni_superuser_username }} -d postgres -tAXc
"select current_setting('server_version')"
register: postgres_version
until: postgres_version is success
delay: 5
retries: 3
changed_when: false
when: inventory_hostname in groups['primary']

Expand Down
17 changes: 17 additions & 0 deletions roles/upgrade/tasks/pre_checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -278,13 +278,30 @@
vars:
ssh_key_user: postgres

# if pg_new_wal_dir is defined (for synchronize wal dir)
- name: '[Pre-Check] Make sure that the sshpass package are installed'
become: true
become_user: root
ansible.builtin.package:
name: sshpass
state: present
register: package_status
until: package_status is success
delay: 5
retries: 3
when: pg_new_wal_dir | length > 0

# Rsync Checks
- name: '[Pre-Check] Make sure that the rsync package are installed'
become: true
become_user: root
ansible.builtin.package:
name: rsync
state: present
register: package_status
until: package_status is success
delay: 5
retries: 3

- name: '[Pre-Check] Rsync Checks: create testrsync file on Primary'
become: true
Expand Down
13 changes: 11 additions & 2 deletions roles/upgrade/tasks/rollback.yml
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,17 @@
- inventory_hostname in groups['primary']
- pg_control_version.stdout == pg_new_version | replace('.', '')

# Revert the paths to the old PostgreSQL
- name: '[Rollback] Revert the paths to the old PostgreSQL in patroni.yml'
# Restore the old Patroni configuration
- name: '[Rollback] Restore the old patroni.yml configuration file'
ansible.builtin.copy:
src: "{{ patroni_config_file }}.bkp"
dest: "{{ patroni_config_file }}"
owner: postgres
group: postgres
mode: "0640"
remote_src: true

- name: '[Rollback] Ensure old PostgreSQL paths are set in patroni.yml'
ansible.builtin.replace:
path: "{{ patroni_config_file }}"
regexp: "{{ item.regexp }}"
Expand Down
7 changes: 7 additions & 0 deletions roles/upgrade/tasks/update_config.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
---
# Prepare the parameters for Patroni.

- name: "Backup the patroni.yml configuration file"
ansible.builtin.copy:
src: "{{ patroni_config_file }}"
dest: "{{ patroni_config_file }}.bkp"
remote_src: true

# Update the directory path to a new version of PostgresSQL
- name: "Edit patroni.yml | update parameters: data_dir, bin_dir, config_dir"
ansible.builtin.replace:
Expand Down
14 changes: 14 additions & 0 deletions roles/upgrade/tasks/update_extensions.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,11 @@
{{ pg_new_bindir }}/psql -p {{ postgresql_port }} -U {{ patroni_superuser_username }} -d {{ pg_target_dbname }} -tAXc
"select extname from pg_catalog.pg_extension"
register: pg_installed_extensions
until: pg_installed_extensions is success
delay: 5
retries: 3
changed_when: false
ignore_errors: true # show the error and continue the playbook execution
when:
- inventory_hostname in groups['primary']

Expand All @@ -16,7 +20,11 @@
join pg_catalog.pg_available_extensions ae on extname = ae.name
where installed_version <> default_version"
register: pg_old_extensions
until: pg_old_extensions is success
delay: 5
retries: 3
changed_when: false
ignore_errors: true # show the error and continue the playbook execution
when:
- inventory_hostname in groups['primary']

Expand All @@ -28,6 +36,7 @@
- "No update is required."
when:
- inventory_hostname in groups['primary']
- pg_old_extensions is success
- pg_old_extensions.stdout_lines | length < 1

# if pg_stat_cache is not installed
Expand All @@ -40,6 +49,8 @@
loop: "{{ pg_old_extensions.stdout_lines | reject('match', '^pg_repack$') | list }}"
when:
- inventory_hostname in groups['primary']
- pg_old_extensions is success
- pg_installed_extensions is success
- pg_old_extensions.stdout_lines | length > 0
- (not 'pg_stat_kcache' in pg_installed_extensions.stdout_lines)

Expand All @@ -63,6 +74,8 @@
ignore_errors: true # show the error and continue the playbook execution
when:
- inventory_hostname in groups['primary']
- pg_old_extensions is success
- pg_installed_extensions is success
- pg_old_extensions.stdout_lines | length > 0
- ('pg_stat_statements' in pg_old_extensions.stdout_lines or 'pg_stat_kcache' in pg_old_extensions.stdout_lines)
- ('pg_stat_kcache' in pg_installed_extensions.stdout_lines)
Expand All @@ -76,6 +89,7 @@
ignore_errors: true # show the error and continue the playbook execution
when:
- inventory_hostname in groups['primary']
- pg_old_extensions is success
- (pg_old_extensions.stdout_lines | length > 0 and 'pg_repack' in pg_old_extensions.stdout_lines)

...
14 changes: 14 additions & 0 deletions roles/upgrade/tasks/upgrade_primary.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,23 @@
shared_preload_libraries: "-c shared_preload_libraries='{{ pg_shared_preload_libraries_value }}'"
timescaledb_restoring: "{{ \"-c timescaledb.restoring='on'\" if 'timescaledb' in pg_shared_preload_libraries_value else '' }}"
register: pg_upgrade_result
ignore_errors: true # show the error and perform rollback
when:
- inventory_hostname in groups['primary']

# Stop, if the upgrade failed
- block:
- name: Perform rollback
ansible.builtin.include_tasks: rollback.yml

- name: "ERROR: PostgreSQL upgrade failed"
ansible.builtin.fail:
msg:
- "The PostgreSQL upgrade has encountered an error and a rollback has been initiated."
- "For detailed information, please consult the pg_upgrade log located at '{{ pg_new_datadir }}/pg_upgrade_output.d'"
run_once: true
when: hostvars[groups['primary'][0]].pg_upgrade_result is failed

# If the length of the pg_upgrade_result.stdout_lines is greater than 100 lines,
# the upgrade_output variable will include the first 70 lines, an ellipsis (...),
# and the last 30 lines of the pg_upgrade_result.stdout_lines.
Expand Down

0 comments on commit 0a09ef1

Please sign in to comment.