Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pgBackRest: Point-In-Time Recovery (PITR) Improvements #765

Merged
merged 29 commits into from
Nov 12, 2024

Conversation

vitabaks
Copy link
Owner

@vitabaks vitabaks commented Sep 20, 2024

This PR introduces a configurable cluster_restore_timeout variable to replace hardcoded values and provide flexibility. The timeout has been increased to 24 hours, making it more suitable for restoring large databases, including both backup and WAL recovery.

Additionally, we've added the task "Wait for the Standby cluster initialization to complete". This task allows tracking the progress of Standby cluster initialization in the Ansible logs. The same 24-hour timeout is applied to this task to ensure it has ample time to complete for large databases.

Variables:

  • cluster_restore_timeout - backup and WAL restore timeout in seconds, default: 86400 (24 hours)
  • pg_ctl_timeout - pg_ctl start and stop timeout, default: 3600 (1 hour)

Additionally

  • Set the restore command for replicas according to the value of the pgbackrest_patroni_cluster_restore_command variable.
  • Remove duplicate pgbackrest vars from the inventory.
  • Check that PostgreSQL is stopped before running the restore command.
  • Update task "Start PostgreSQL for Recovery".
    • Add the --config-file option.
    • Add the restore_command option yo be able to restore WAL from the archive even if the postgresql_restore_command variable is not set.
    • Add the archive_command option to eliminate the risk of archiving WAL immediately after the promotion. It makes sense for cloning when we don't want to send the WAL to the main cluster storage.
    • Run the startup command asynchronously and add the task "Wait for the PostgreSQL start command to complete".
  • Improve the reliability of the verification of the achievement of the recovery target.
  • Get and print the PostgreSQL recovery log details.

Add patroni_cluster_restore_timeout variable
@vitabaks vitabaks self-assigned this Sep 20, 2024
@vitabaks vitabaks changed the title Add cluster restore timeout Add cluster restore timeout and Standby cluster initialization check Sep 20, 2024
@vitabaks vitabaks changed the title Add cluster restore timeout and Standby cluster initialization check Add restore timeout and Standby Cluster initialization check Sep 20, 2024
@vitabaks vitabaks added the enhancement Improvement of the current functionality label Sep 20, 2024
@vitabaks vitabaks changed the title Add restore timeout and Standby Cluster initialization check pgBackRest: Point-In-Time Recovery (PITR) Improvements Nov 1, 2024
@vitabaks
Copy link
Owner Author

vitabaks commented Nov 1, 2024

Test

Result:


TASK [patroni : Stop patroni service on the Replica servers (if running)] ***************************************************************************************************************************************
changed: [172.31.33.111]
changed: [172.31.42.255]

TASK [patroni : Stop patroni service on the Master server (if running)] *****************************************************************************************************************************************
changed: [172.31.39.47]

TASK [patroni : Check that PostgreSQL is stopped] ***************************************************************************************************************************************************************
ok: [172.31.33.111]
ok: [172.31.39.47]
ok: [172.31.42.255]

TASK [patroni : Remove patroni cluster "vitabaks-pgcluster" from DCS (if exist)] ********************************************************************************************************************************
changed: [172.31.39.47]

TASK [patroni : Run "/usr/bin/pgbackrest --stanza=vitabaks-pgcluster --type=time "--target=2024-11-01 12:47:47+00" --delta restore" on Master] ******************************************************************
changed: [172.31.39.47]

TASK [patroni : Run "/usr/bin/pgbackrest --stanza=vitabaks-pgcluster --type=time "--target=2024-11-01 12:47:47+00" --delta restore" on Replica] *****************************************************************
changed: [172.31.33.111]
changed: [172.31.42.255]
FAILED - RETRYING: [172.31.42.255]: Waiting for restore from backup (2880 retries left).
FAILED - RETRYING: [172.31.39.47]: Waiting for restore from backup (2880 retries left).
FAILED - RETRYING: [172.31.33.111]: Waiting for restore from backup (2880 retries left).

TASK [patroni : Waiting for restore from backup] ****************************************************************************************************************************************************************
changed: [172.31.39.47] => (item=True)
changed: [172.31.42.255] => (item=True)
changed: [172.31.33.111] => (item=True)

TASK [patroni : Start PostgreSQL for Recovery] ******************************************************************************************************************************************************************
changed: [172.31.33.111]
changed: [172.31.39.47]
changed: [172.31.42.255]
FAILED - RETRYING: [172.31.39.47]: Wait for the PostgreSQL start command to complete (360 retries left).
FAILED - RETRYING: [172.31.33.111]: Wait for the PostgreSQL start command to complete (360 retries left).
FAILED - RETRYING: [172.31.42.255]: Wait for the PostgreSQL start command to complete (360 retries left).

TASK [patroni : Wait for the PostgreSQL start command to complete] **********************************************************************************************************************************************
changed: [172.31.39.47]
changed: [172.31.33.111]
changed: [172.31.42.255]
FAILED - RETRYING: [172.31.39.47]: Wait for PostgreSQL recovery to complete (WAL apply) (2880 retries left).

TASK [patroni : Wait for PostgreSQL recovery to complete (WAL apply)] *******************************************************************************************************************************************
ok: [172.31.39.47]

TASK [patroni : Get PostgreSQL recovery log] ********************************************************************************************************************************************************************
ok: [172.31.39.47]

TASK [patroni : Print PostgreSQL recovery log] ******************************************************************************************************************************************************************
ok: [172.31.39.47] => {
    "msg": [
        "2024-11-01 12:37:52.272 UTC [11910] LOG:  recovery stopping before commit of transaction 248047, time 2024-10-21 11:57:58.000145+00",
        "2024-11-01 12:37:52.272 UTC [11910] LOG:  redo done at 0/6591C308 system usage: CPU: user: 8.84 s, system: 5.19 s, elapsed: 45.81 s",
        "2024-11-01 12:37:52.272 UTC [11910] LOG:  last completed transaction was at log time 2024-10-21 11:57:57.999688+00"
    ]
}

TASK [patroni : Check that PostgreSQL is stopped] ***************************************************************************************************************************************************************
ok: [172.31.33.111]
ok: [172.31.39.47]
ok: [172.31.42.255]

TASK [patroni : Stop PostgreSQL] ********************************************************************************************************************************************************************************
changed: [172.31.39.47]

TASK [patroni : Check the patroni.dynamic.json exists] **********************************************************************************************************************************************************
ok: [172.31.39.47]
ok: [172.31.33.111]
ok: [172.31.42.255]

TASK [patroni : Start patroni service on the Master server] *****************************************************************************************************************************************************
changed: [172.31.39.47]

TASK [patroni : Wait for port 8008 to become open on the host] **************************************************************************************************************************************************
ok: [172.31.39.47]

TASK [patroni : Check PostgreSQL is started and accepting connections] ******************************************************************************************************************************************
ok: [172.31.39.47]

TASK [patroni : Wait for the cluster to initialize (master is the leader with the lock)] ************************************************************************************************************************
ok: [172.31.39.47]

TASK [patroni : Make sure the Master is not in recovery mode] ***************************************************************************************************************************************************
ok: [172.31.39.47]

TASK [patroni : Make sure the superuser and replication users are present, and password does not differ from the specified] *************************************************************************************
changed: [172.31.39.47] => (item=postgres)
changed: [172.31.39.47] => (item=replicator)

TASK [patroni : Make sure the postgresql users are present, and password does not differ from the specified] ****************************************************************************************************
ok: [172.31.39.47] => (item=pgbouncer)

TASK [patroni : Update postgresql authentication in patroni.yml] ************************************************************************************************************************************************
changed: [172.31.39.47]
changed: [172.31.33.111]
changed: [172.31.42.255]

TASK [patroni : Start patroni service on Replica servers] *******************************************************************************************************************************************************
changed: [172.31.33.111]
changed: [172.31.42.255]

TASK [patroni : Wait for port 8008 to become open on the host] **************************************************************************************************************************************************
ok: [172.31.42.255]
ok: [172.31.33.111]

TASK [patroni : Check that the patroni is healthy on the replica server] ****************************************************************************************************************************************
ok: [172.31.42.255]
ok: [172.31.33.111]

TASK [deploy-finish : Get Postgres users] ***********************************************************************************************************************************************************************
ok: [172.31.39.47]

TASK [deploy-finish : Get Postgres databases] *******************************************************************************************************************************************************************
ok: [172.31.39.47]

TASK [deploy-finish : Get Postgres cluster info] ****************************************************************************************************************************************************************
ok: [172.31.39.47]

TASK [deploy-finish : Postgres list of users] *******************************************************************************************************************************************************************
ok: [172.31.39.47] => {
    "msg": [
        "                              List of roles",
        " Role name  |                         Attributes                         ",
        "------------+------------------------------------------------------------",
        " pgbouncer  | ",
        " postgres   | Superuser, Create role, Create DB, Replication, Bypass RLS",
        " replicator | Replication"
    ]
}

TASK [deploy-finish : Postgres list of databases] ***************************************************************************************************************************************************************
ok: [172.31.39.47] => {
    "msg": [
        "                                                       List of databases",
        "   Name    |  Owner   | Encoding | Locale Provider |   Collate   |    Ctype    | ICU Locale | ICU Rules |   Access privileges   ",
        "-----------+----------+----------+-----------------+-------------+-------------+------------+-----------+-----------------------",
        " postgres  | postgres | UTF8     | libc            | en_US.UTF-8 | en_US.UTF-8 |            |           | ",
        " template0 | postgres | UTF8     | libc            | en_US.UTF-8 | en_US.UTF-8 |            |           | =c/postgres          +",
        "           |          |          |                 |             |             |            |           | postgres=CTc/postgres",
        " template1 | postgres | UTF8     | libc            | en_US.UTF-8 | en_US.UTF-8 |            |           | =c/postgres          +",
        "           |          |          |                 |             |             |            |           | postgres=CTc/postgres",
        "(3 rows)"
    ]
}

TASK [deploy-finish : Postgres Cluster info] ********************************************************************************************************************************************************************
ok: [172.31.39.47] => {
    "msg": [
        "+ Cluster: vitabaks-pgcluster (7428194193594549507) ---------------+----+-----------+",
        "| Member           | Host          | Role    | State               | TL | Lag in MB |",
        "+------------------+---------------+---------+---------------------+----+-----------+",
        "| ip-172-31-33-111 | 172.31.33.111 | Replica | in archive recovery |  3 |         0 |",
        "| ip-172-31-39-47  | 172.31.39.47  | Leader  | running             |  5 |           |",
        "| ip-172-31-42-255 | 172.31.42.255 | Replica | in archive recovery |  3 |         0 |",
        "+------------------+---------------+---------+---------------------+----+-----------+"
    ]
}

PLAY RECAP ******************************************************************************************************************************************************************************************************
172.31.33.111              : ok=38   changed=7    unreachable=0    failed=0    skipped=65   rescued=0    ignored=0   
172.31.39.47               : ok=58   changed=10   unreachable=0    failed=0    skipped=72   rescued=0    ignored=0   
172.31.42.255              : ok=38   changed=7    unreachable=0    failed=0    skipped=65   rescued=0    ignored=0   
localhost                  : ok=19   changed=0    unreachable=0    failed=0    skipped=206  rescued=0    ignored=0   

@vitabaks vitabaks merged commit af68b0a into master Nov 12, 2024
16 checks passed
@vitabaks vitabaks deleted the restore-timeout branch November 12, 2024 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement of the current functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant