Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patroni "CRITICAL: system ID mismatch, node belongs to a different cluster" #770

Closed
Hannsre opened this issue Sep 24, 2024 · 21 comments
Closed
Assignees

Comments

@Hannsre
Copy link

Hannsre commented Sep 24, 2024

Hey,

first of all thanks for this project! Really appreciate the time and effort.

I'm kinda stuck though, not really sure how to proceed. I've had a cluster running already, but since we're still testing and I discovered some issues, I tore it down to redeploy it. The first deployment was on an older version of the ansible playbook though, so I'm not sure what changed. I compared my old version with the new and couldn't really spot much of a difference.

Basis are Debian 12 VMs on Proxmox 8 deployed via Terraform, so they are absolutely identical.

The patroni master Node works apparently, but the Replicas won't connect. According to the Logs they are in a different Cluster ID.

I've checked and tried the solution in #747 , but unfortunately it didn't do the trick for me. After removing the configs and cluster, it couldn't connect to postgresql anymore at all.

This is the output from the playbook:

TASK [patroni : Wait for port 8008 to become open on the host] *********************************************************************************************************************************************************************************************************************
fatal: [192.168.88.103]: FAILED! => {"changed": false, "elapsed": 120, "msg": "Timeout when waiting for 192.168.88.103:8008"}
fatal: [192.168.88.102]: FAILED! => {"changed": false, "elapsed": 120, "msg": "Timeout when waiting for 192.168.88.102:8008"}

And the logs on both Replica Hosts:

Sep 24 08:58:30 pgcn02 patroni[37244]: 2024-09-24 08:58:30,734 INFO: Selected new etcd server http://192.168.88.102:2379
Sep 24 08:58:30 pgcn02 patroni[37244]: 2024-09-24 08:58:30,739 INFO: No PostgreSQL configuration items changed, nothing to reload.
Sep 24 08:58:30 pgcn02 patroni[37244]: 2024-09-24 08:58:30,741 CRITICAL: system ID mismatch, node pgcn02 belongs to a different cluster: 7418131322725245191 != 7418131302586179729
Sep 24 08:58:30 pgcn02 systemd[1]: patroni.service: Main process exited, code=exited, status=1/FAILURE
Sep 24 08:58:30 pgcn03 patroni[37243]: 2024-09-24 08:58:30,843 INFO: Selected new etcd server http://192.168.88.101:2379
Sep 24 08:58:30 pgcn03 patroni[37243]: 2024-09-24 08:58:30,847 INFO: No PostgreSQL configuration items changed, nothing to reload.
Sep 24 08:58:30 pgcn03 patroni[37243]: 2024-09-24 08:58:30,849 CRITICAL: system ID mismatch, node pgcn03 belongs to a different cluster: 7418131322725245191 != 7418131302059765907
Sep 24 08:58:30 pgcn03 systemd[1]: patroni.service: Main process exited, code=exited, status=1/FAILURE

This is the relvant part in my inventory (I think):

# PostgreSQL nodes
[master]
192.168.88.101

[replica]
192.168.88.102
192.168.88.103

[postgres_cluster:children]
master
replica

I've replaced the cluster completely twice now, changing the cluster name in between installs, but to no avail.
Not sure what else you need to troubleshoot, just let me know which logs or config you need.

Thanks in advance!

@vitabaks
Copy link
Owner

Hi @Hannsre,

This error indicates that the replica is trying to join a cluster, but its system ID doesn't match the primary, meaning the replica belongs to another cluster.

  1. Since the Primary is running, ensure that it contains all necessary data before proceeding.
  2. On the replica: stop the Patroni service, clear the data directory, and start the Patroni service. This will trigger resynchronization from the Primary with the correct cluster ID.

@Hannsre
Copy link
Author

Hannsre commented Sep 24, 2024

Thanks for the quickt reply!

Not sure how to check whether all the necessary data is there tbh.

After removing the data directory and starting patroni I get at least a different error, so this seems connected to pgbackrest?
That's at least the only occurence of repo1-path I could find in the playbook.

Sep 24 10:40:18 pgcn02 patroni[49819]: 2024-09-24 10:40:18,812 INFO: Selected new etcd server http://192.168.88.103:2379
Sep 24 10:40:18 pgcn02 patroni[49819]: 2024-09-24 10:40:18,817 INFO: No PostgreSQL configuration items changed, nothing to reload.
Sep 24 10:40:18 pgcn02 patroni[49819]: 2024-09-24 10:40:18,863 INFO: Lock owner: pgcn01; I am pgcn02
Sep 24 10:40:18 pgcn02 patroni[49819]: 2024-09-24 10:40:18,911 INFO: trying to bootstrap from leader 'pgcn01'
Sep 24 10:40:18 pgcn02 patroni[49835]: P00  ERROR: [031]: option 'repo1-path' cannot be set multiple times
Sep 24 10:40:18 pgcn02 patroni[49819]: 2024-09-24 10:40:18,921 ERROR: Error creating replica using method pgbackrest: /usr/bin/pgbackrest --stanza=postgresql-cluster --delta restore exited with code=31
Sep 24 10:40:18 pgcn02 patroni[49819]: 2024-09-24 10:40:18,921 ERROR: failed to bootstrap from leader 'pgcn01'
Sep 24 10:40:18 pgcn02 patroni[49819]: 2024-09-24 10:40:18,921 INFO: Removing data directory: /var/lib/postgresql/16/main

That's also one of the major changes I did, setting pgbackrest_install: true instead of false, als I added a S3 Bucket to the config according to the docs by adding this block (with our data ofc):

    - { option: "repo1-s3-key", value: "YOUR_AWS_S3_ACCESS_KEY" }  # change this value
    - { option: "repo1-s3-key-secret", value: "YOUR_AWS_S3_SECRET_KEY" }  # change this value
    - { option: "repo1-s3-bucket", value: "YOUR_BUCKET_NAME" }  # change this value
    - { option: "repo1-s3-endpoint", value: "s3.us-east-1.amazonaws.com" }  # change this value
    - { option: "repo1-s3-region", value: "us-east-1" }  # change this value

Could also use MinioS3 or SSH, as I have both available as well, but I'm not sure if that would make a difference. I couldn't locate the pgbackrest_auto_conf variable though, so not sure what it's set to currently.

@vitabaks
Copy link
Owner

Not sure how to check whether all the necessary data is there tbh.

Hopefully, we are dealing with a test database cluster. The fact that you somehow have two different 'system IDs' within the same cluster indicates that there may be discrepancies between the data on replica servers and the Primary. Therefore, it is crucial to conduct a thorough analysis of which data is valuable to you. This becomes especially important if such a scenario were to occur in a production environment.

After removing the data directory and starting patroni I get at least a different error, so this seems connected to pgbackrest?

Which version of postgresql_cluster are you using? Could you attach a configuration file /etc/pgbackrest/pgbackrest.conf?

@Hannsre If you're using pgBackRest to prepare replicas, then you already have a backup stored, correct? Otherwise, you should only specify basebackup in the patroni_create_replica_methods variable.

@Hannsre
Copy link
Author

Hannsre commented Sep 26, 2024

Hopefully, we are dealing with a test database cluster.

Yes, this is primarily for Testing, there is no Data involved yet.
This Setup is mostly for learning so I'm fine with setting it up and tearing it down again until I got it right and also got it to restore from an actual backup for disaster recovery and so on.

I should be on Release 2.0, as I just cloned the repo a few days ago.

I got the differing IDs figured, because this was my mistake:

If you're using pgBackRest to prepare replicas, then you already have a backup stored, correct?

I did not, misunderstood the config in the playbook and fixed it now so creating the Cluster now seems to be working. At least Patroni logs look fine and Primary and Replica see each other.

Here's the pgbackrest.conf content:

[global]
log-level-file=detail
log-path=/var/log/pgbackrest
repo1-type=s3
repo1-path=/pgbackrest
repo1-s3-key=redacted
repo1-s3-key-secret=redacted
repo1-s3-bucket=name of our bucket like shown in minio
repo1-s3-endpoint=our minio domain
repo1-s3-region=our minio region
repo1-retention-full=4
repo1-retention-archive=4
repo1-bundle=y
repo1-block=y
start-fast=y
stop-auto=y
link-all=y
resume=n
spool-path=/var/spool/pgbackrest
archive-async=y
archive-get-queue-max=1GiB

[stanza-name]
process-max=4
log-level-console=info
recovery-option=recovery_target_action=promote
pg1-socket-path=/var/run/postgresql
pg1-path=/var/lib/postgresql/16/main

What I'm still struggling with is pgbackrest in general. I've set it up as above, but get an error that the minio Domain can't be found, because the URL pgbackrest creates/uses is bucketname.domain.com instead of domain.com/bucketname which doesn't make much sense to me.

[HostConnectError] on 2 retries from 120100-180201ms: unable to get address for 'bucketname.minio-domain.com': [-2] Name or service not known",

Also there is unable to acquire lock on file '/tmp/pgbackrest/stnaza-name-archive-1.lock': Resource temporarily unavailable" - I can see the Lockfile being created on the hosts though.

Here's the pgbackrest config part from main.yml:

# pgBackRest
pgbackrest_install: true  # or 'true' to install and configure backups using pgBackRest
pgbackrest_install_from_pgdg_repo: true  # or 'false'
pgbackrest_stanza: "{{ patroni_cluster_name }}"  # specify your --stanza
pgbackrest_repo_type: "s3"  # or "s3", "gcs", "azure"
pgbackrest_repo_host: ""  # dedicated repository host (optional)
pgbackrest_repo_user: "postgres"  # if "repo_host" is set (optional)
pgbackrest_conf_file: "/etc/pgbackrest/pgbackrest.conf"
# config https://pgbackrest.org/configuration.html
pgbackrest_conf:
  global:  # [global] section
    - { option: "log-level-file", value: "detail" }
    - { option: "log-path", value: "/var/log/pgbackrest" }
    - { option: "repo1-type", value: "{{ pgbackrest_repo_type | lower }}" }
    - { option: "repo1-path", value: "/pgbackrest" }  # logical path in bucket
    - { option: "repo1-s3-key", value: "{{ s3_access_key }}" }  # change this value
    - { option: "repo1-s3-key-secret", value: "{{ s3_secret_key }}" }  # change this value
    - { option: "repo1-s3-bucket", value: "bucketname" }  # change this value
    - { option: "repo1-s3-endpoint", value: "minio URL" }  # change this value
    - { option: "repo1-s3-region", value: "Minio Region" }  # change this value
    - { option: "repo1-retention-full", value: "4" }
    - { option: "repo1-retention-archive", value: "4" }
    - { option: "repo1-bundle", value: "y" }
    - { option: "repo1-block", value: "y" }
    - { option: "start-fast", value: "y" }
    - { option: "stop-auto", value: "y" }
    - { option: "link-all", value: "y" }
    - { option: "resume", value: "n" }
    - { option: "spool-path", value: "/var/spool/pgbackrest" }
    - { option: "archive-async", value: "y" } # Enables asynchronous WAL archiving (details: https://pgbackrest.org/user-guide.html#async-archiving)
    - { option: "archive-get-queue-max", value: "1GiB" }
#    - { option: "archive-push-queue-max", value: "100GiB" }
#    - { option: "backup-standby", value: "y" } # When set to 'y', standby servers will be automatically added to the stanza section.
#    - { option: "", value: "" }
  stanza:  # [stanza_name] section
    - { option: "process-max", value: "4" }
    - { option: "log-level-console", value: "info" }
    - { option: "recovery-option", value: "recovery_target_action=promote" }
    - { option: "pg1-socket-path", value: "{{ postgresql_unix_socket_dir }}" }
    - { option: "pg1-path", value: "{{ postgresql_data_dir }}" }
#    - { option: "", value: "" }

s3_access_key and s3_scret_key are set in an ansible vault and passed from there.

My goal in general is to be able to deploy and recreate the cluster from Backups using ansible, so in case anything goes wrong it can be restored to a working state quickly. But I'm not sure what's still not right in the playbook to get pgbackrest to actually deploy to a working state.

@Hannsre
Copy link
Author

Hannsre commented Sep 26, 2024

Quick Update:
I had to add - { option: "repo1-s3-uri-style", value: "path" } to the pgbackrest_conf which apparently I forgot.
Also had to add - { option: "pg1-user", value: "postgres" }to the stanza to get a connection.

So the repository now got created in our minio bucket.

Stuck now at [073]: unable to list file info for missing path '/var/lib/postgresql/16/main/archive_status' while trying to create the first backup.

So it's mostly me fighting pgbackrest config now I guess.

@vitabaks
Copy link
Owner

Yes, according to the configuration example for Minio in the documentation - https://postgresql-cluster.org/management/backup#command-line the "repo1-s3-uri-style" option is required.

@vitabaks
Copy link
Owner

vitabaks commented Sep 26, 2024

Name or service not known

Add the minio domain in the etc_hosts variable.

@Hannsre
Copy link
Author

Hannsre commented Sep 26, 2024

Add the minio domain in the etc_hosts variable.

It's a FQDN with proper DNS set up, the error came only due to me not adding the repo1-s3-uri-style-Option.

Think I got it working now, at least the bucket is now getting populated with archives on an archive run as well as no more errors in the logs.

The error was sitting in front of the terminal after all.

Thanks for your time and patience!

@Hannsre Hannsre closed this as completed Sep 26, 2024
@Hannsre
Copy link
Author

Hannsre commented Oct 10, 2024

Hey,

sorry to reopen this, but after some more testing (and a vacation) some more questions came up specifically for backups and restore. Didn't want to open a new issue as this is still connected to my former issues.
Deploying a new cluster works fine now, as well as creating the backups and backup tasks. The Backups do work in theory, but are a bit flaky.

The restore command:


Sometimes the restore process simply fails at step Waiting for PostgreSQL Recovery to complete (WAL apply); with a pretty generic error:

atal: [192.168.88.103]: FAILED! => {"changed": true, "cmd": ["/usr/lib/postgresql/16/bin/pg_ctl", "start", "-D", "/var/lib/postgresql/16/main", "-o", "-c hot_standby=off", "-w", "-t", "1800"], "delta": "0:00:00.205953", "end": "2024-10-10 08:52:21.402731", "msg": "non-zero return code", "rc": 1, "start": "2024-10-10 08:52:21.196778", "stderr": "pg_ctl: could not start server\nExamine the log output.", 

I know this can happen if there's already a postgres process running, but in those cases it usually said exactly that. Here I get no more details than this.

There are no more errors in any of the logs, all of the restore processes show success. Last messages before the failure above are:

"2024-10-10 08:52:21.395 P00   INFO: found 000000020000000000000021 in the archive asynchronously", "2024-10-10 08:52:21.395 P00   INFO: archive-get command end: completed successfully (1ms)", "2024-10-10 08:52:21.396 UTC [449893] LOG:  restored log file \"000000020000000000000021\" from archive", "2024-10-10 08:52:21.399 UTC [449893] LOG:  completed backup recovery with redo LSN 0/1E000028 and end LSN 0/1E000138", "2024-10-10 08:52:21.399 UTC 
[449893] LOG:  consistent recovery state reached at 0/1E000138", "2024-10-10 08:52:21.399 UTC [449893] LOG:  recovery stopping after reaching consistency", "2024-10-10 08:52:21.400 UTC [449890] LOG:  shutdown at recovery target" 
"2024-10-10 08:52:21.400 UTC [449891] LOG:  shutting down", "2024-10-10 08:52:21.401 UTC [449890] LOG:  database system is shut down", " stopped waiting"]}

If I run the restore again it mostly works afterwards for that node, but then hangs for a very long time after the replicas have been restored while waiting for the master. There is just no output at all by ansible, even at highest verbosity level. I can see the command in htop on the master, but can't see anything happening. Last successful recovery took about 1,5 hours of waiting for the master node.

Settings in main.yml:

# https://patroni.readthedocs.io/en/latest/replica_bootstrap.html#bootstrap
patroni_cluster_bootstrap_method: "pgbackrest"  # or "wal-g", "pgbackrest", "pg_probackup"

# https://patroni.readthedocs.io/en/latest/replica_bootstrap.html#building-replicas
patroni_create_replica_methods:
  - pgbackrest
#  - wal_g
#  - pg_probackup
  - basebackup

pgbackrest:
  - { option: "command", value: "{{ pgbackrest_patroni_cluster_restore_command }}" }
  - { option: "keep_data", value: "True" }
  - { option: "no_params", value: "True" }
basebackup:
  - { option: "max-rate", value: "1000M" }
  - { option: "checkpoint", value: "fast" }
#  - { option: "waldir", value: "{{ postgresql_wal_dir }}" }

and

postgresql_restore_command: "/usr/bin/pgbackrest --stanza={{ pgbackrest_stanza }} archive-get %f %p"  # restore WAL-s using pgbackrest

I've changed the cluster restore command as per the docs to

pgbackrest_patroni_cluster_restore_command:
  '/usr/bin/pgbackrest --stanza={{ pgbackrest_stanza }} --type=immediate --delta restore'

I can also post the full main.yml if you prefer.

Then ran the playbook with -t point_in_time_recovery -e "disable_archive_command=false".
Since it generelly works, but just takes very long, I wonder if it could be any setting on our systems that blocks the process? All nodes are created using ansible and terraform, so they should be absolutely identical.

And a general question because after reading the docs multiple times and trying different approaches I can't seem to find the correct way to restore a Cluster from backup after a complete loss of the cluster. Like when you have to redeploy the cluster to a whole new set of hosts because of a catastrophic failure. PITR seems the wrong approach as it expects an existing cluster. Deploying a new cluster and restoring like above always led to a system ID mismatch while/after restore or it fails due to mismatching stanzas.

Cloning didn't work as well for some reason (I forgot to write down the error here, sorry, but I'll try this way again).

So what would be the best/correct approach here?

I'm sure I'm missing something to properly restore it from a complete loss. This setup is a lot to take in, so every input is much apprechiated.

@Hannsre Hannsre reopened this Oct 10, 2024
@vitabaks
Copy link
Owner

Hi @Hannsre

We are currently working on PITR-related improvements (#765). Could you test the restore-timeout branch?

@SDV109 I also ask you to take a look at this problem.

@Hannsre
Copy link
Author

Hannsre commented Oct 11, 2024

Hey,
sure, giving it a go right now.
Just noticed that there is pg_ctl_timeout set in patroni/tasks/main.yml with a default of 3600, but it's not defined in/vars/main.yml like the new cluster_restore_timeout is.

I've set both timeouts rather short to test, our test-data isn't huge so it shouldn't matter. The restore process itself is usually done within a minute.

Will report back once done.

@Hannsre
Copy link
Author

Hannsre commented Oct 11, 2024

I'm not sure why, as my ansible knowledge isn't that deep yet, but it seems the pg_ctl_timeout is ignored because ansible adds a newline at the end of the command:

changed: [192.168.88.102] => {
    "changed": true,
    "cmd": "/usr/lib/postgresql/16/bin/pg_ctl start -D /var/lib/postgresql/16/main -o '-c hot_standby=off' -w -t 180\n",
    "delta": "0:00:00.240277",
    "end": "2024-10-11 07:42:01.847932",
    "invocation": {
        "module_args": {
            "_raw_params": "/usr/lib/postgresql/16/bin/pg_ctl start -D /var/lib/postgresql/16/main -o '-c hot_standby=off' -w -t 180\n",
            "_uses_shell": true,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": "/bin/bash",
            "removes": null,
            "stdin": null,
            "stdin_add_newline": true,
            "strip_empty_ends": true
        }
    },

So it's like before, it finished restoring on the replicas, but get's stuck at the master node. No ansible output whatsoever and even with timeout set to 180 Seconds it waits much longer than that.

@Hannsre
Copy link
Author

Hannsre commented Oct 28, 2024

Hey guys,

sorry, to bug you again, but any new Ideas on this?
I'm really stuck here because while it does work to deploy the cluster, running it without working backup & restore sounds like a bad time.
I know this is free support in your spare time, so don't take this as critique, I'm simply all out of ideas...

@vitabaks
Copy link
Owner

this week I plan to try to reproduce the problem and suggest an improvement as part of the PR.

@vitabaks vitabaks self-assigned this Oct 28, 2024
@Hannsre
Copy link
Author

Hannsre commented Oct 28, 2024

Thanks!
I also just noticed I forgot to add another error message that might or might not be helpful.
Wenn running the restore, it will succeed on the replicas, but on the master node it fails with:

"2024-10-28 12:07:46.168 P00  ERROR: [038]: unable to restore while PostgreSQL is running", "                                    HINT: presence of 'postmaster.pid' in '/var/lib/postgresql/16/main' indicates PostgreSQL is running.", "                                    HINT: remove 'postmaster.pid' only if PostgreSQL is not running.", "2024-10-28 12:07:46.168 P00   INFO: restore command end: aborted with exception [038]"]}

I then have to stop postgres & Remove the postmaster.pid manually before I can run the playbook again to attempt a restore. Then it'll run untill it times out (or sometimes not) on the master Node.

Not sure what to make of that, but it seems connected since it should stop postgres like it does on the replicas to then restore.

@Hannsre
Copy link
Author

Hannsre commented Oct 29, 2024

So I did some more testing and the whole issue seems to be revolving around the Leader Node not being able to properly restart postgres after the restore.
I stopped/removed postgres like in my comment before by running

sudo systemctl stop postgresql
sudo rm /var/lib/postgresql/16/main/postmaster.pid

run the playbook

ansible-playbook deploy_pgcluster.yml -t point_in_time_recovery -e "disable_archive_command=false" -e '@pgcn.yml' --ask-vault-pass -v

and wait.
When Ansible ist stuck waiting for Postgres to finish and restart on the leader I again run

sudo systemctl stop postgresql
sudo rm /var/lib/postgresql/16/main/postmaster.pid

and afterwards

sudo -u postgres /usr/lib/postgresql/16/bin/pg_ctl start -D /var/lib/postgresql/16/main -o '-c hot_standby=off' -w -t 180

So basically I do manually what ansible is supposed to do from what I gathered.
The playbook will then continue and eventually finish successfully

TASK [deploy-finish : Postgres Cluster info] *************************************************************************************************************************************************************************************************************************************************
ok: [192.168.88.101] => {
    "msg": [
        "+ Cluster: psql-dev (7423735485093921234) +----+-----------+",
        "| Member | Host           | Role    | State   | TL | Lag in MB |",
        "+--------+----------------+---------+---------+----+-----------+",
        "| pgcn01 | 192.168.88.101 | Leader  | running |  4 |           |",
        "| pgcn02 | 192.168.88.102 | Replica | running |  2 |         0 |",
        "| pgcn03 | 192.168.88.103 | Replica | running |  2 |         0 |",
        "+--------+----------------+---------+---------+----+-----------+"
    ]
}

PLAY RECAP ***********************************************************************************************************************************************************************************************************************************************************************************
192.168.88.101             : ok=59   changed=6    unreachable=0    failed=0    skipped=68   rescued=0    ignored=0   
192.168.88.102             : ok=38   changed=4    unreachable=0    failed=0    skipped=64   rescued=0    ignored=0   
192.168.88.103             : ok=38   changed=4    unreachable=0    failed=0    skipped=64   rescued=0    ignored=0   
localhost                  : ok=2    changed=0    unreachable=0    failed=0    skipped=222  rescued=0    ignored=0   

I'm unsure what to make of this, but maybe it helps you find any potential Issues.

It also does look like the restore is successful, at least I get no error by the application when I point it to use the cluster instead of it's local DB and the data seems to be all there.

Small Addition/Edit:
While running the restore for the first time I don't have to stop/remove postgres and the pid, but simply kill the running process postgres /usr/lib/postgresql/16/bin/pg_ctl start -D /var/lib/postgresql/16/main -o '-c hot_standby=off and run it again to get the Playbook to finish.

I also deleted some entries from the DB, then ran the restore like above and it all came back as supposed.

@vitabaks
Copy link
Owner

Thanks for the details provided, it's really helpful.

@vitabaks
Copy link
Owner

vitabaks commented Nov 7, 2024

@Hannsre I have made some important improvements here #765, please check.

@Hannsre
Copy link
Author

Hannsre commented Nov 11, 2024

Thanks, this seems to work fine now! Updated the patroni main.yml and ran the restore like before and it went through just fine and pretty quick as well.

For my understanding: You changed the handling of the leader node so it now runs a check to see if postgres is (still) up, then stop it if true on Lines 704-727? Then added further checks for starting it back up.

One last question, now that restore is working fine: What is the best way to recover in case of a complete loss of the cluster? I'll backup the nodes in general as well, so I can restore those, but I wanna cover all my bases before going into production. You never know what kind of catastrophic failure may happen ;)

I've had issues recreating the cluster nodes after deleting them on purpose, because it complained either about postgres not existing (when running with PITR command) or that the stanza already exist if creating a new cluster. Setting the bootstrap method to pgbackrest also did not work (I forgot the error tbh). But this might've been connected to this former issue, because this seems to be the correct way from my understanding. Or did I get this wrong and there's another, correct way?

@vitabaks
Copy link
Owner

Thanks, this seems to work fine now! Updated the patroni main.yml and ran the restore like before and it went through just fine and pretty quick as well.

Great, I'm glad it worked out.

For my understanding: You changed the handling of the leader node so it now runs a check to see if postgres is (still) up, then stop it if true on Lines 704-727? Then added further checks for starting it back up.

Yes, we have added additional checks to ensure the reliability of the recovery procedure.

What is the best way to recover in case of a complete loss of the cluster?

This is described in the documentation https://postgresql-cluster.org/docs/management/restore

I've had issues recreating the cluster nodes after deleting them on purpose, because it complained either about postgres not existing (when running with PITR command) or that the stanza already exist if creating a new cluster. Setting the bootstrap method to pgbackrest also did not work (I forgot the error tbh). But this might've been connected to this former issue, because this seems to be the correct way from my understanding. Or did I get this wrong and there's another, correct way?

I recommend performing more recoveries to practice and build confidence in the restoration process. I personally manage hundreds of clusters and have been using this method for over 5 years for in-place recovery or cloning to a new cluster.

Additionally, to ensure you always have reliable and timely assistance, you may consider purchasing an individual support package.

@vitabaks
Copy link
Owner

I am closing this issue, if you have any problems or questions, feel free to open a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants
@vitabaks @Hannsre and others