-
-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patroni "CRITICAL: system ID mismatch, node belongs to a different cluster" #770
Comments
Hi @Hannsre, This error indicates that the replica is trying to join a cluster, but its system ID doesn't match the primary, meaning the replica belongs to another cluster.
|
Thanks for the quickt reply! Not sure how to check whether all the necessary data is there tbh. After removing the data directory and starting patroni I get at least a different error, so this seems connected to pgbackrest?
That's also one of the major changes I did, setting
Could also use MinioS3 or SSH, as I have both available as well, but I'm not sure if that would make a difference. I couldn't locate the |
Hopefully, we are dealing with a test database cluster. The fact that you somehow have two different 'system IDs' within the same cluster indicates that there may be discrepancies between the data on replica servers and the Primary. Therefore, it is crucial to conduct a thorough analysis of which data is valuable to you. This becomes especially important if such a scenario were to occur in a production environment.
Which version of postgresql_cluster are you using? Could you attach a configuration file @Hannsre If you're using pgBackRest to prepare replicas, then you already have a backup stored, correct? Otherwise, you should only specify |
Yes, this is primarily for Testing, there is no Data involved yet. I should be on Release 2.0, as I just cloned the repo a few days ago. I got the differing IDs figured, because this was my mistake:
I did not, misunderstood the config in the playbook and fixed it now so creating the Cluster now seems to be working. At least Patroni logs look fine and Primary and Replica see each other. Here's the
What I'm still struggling with is pgbackrest in general. I've set it up as above, but get an error that the minio Domain can't be found, because the URL pgbackrest creates/uses is bucketname.domain.com instead of domain.com/bucketname which doesn't make much sense to me.
Also there is Here's the pgbackrest config part from
My goal in general is to be able to deploy and recreate the cluster from Backups using ansible, so in case anything goes wrong it can be restored to a working state quickly. But I'm not sure what's still not right in the playbook to get pgbackrest to actually deploy to a working state. |
Quick Update: So the repository now got created in our minio bucket. Stuck now at So it's mostly me fighting pgbackrest config now I guess. |
Yes, according to the configuration example for Minio in the documentation - https://postgresql-cluster.org/management/backup#command-line the "repo1-s3-uri-style" option is required. |
Add the minio domain in the etc_hosts variable. |
It's a FQDN with proper DNS set up, the error came only due to me not adding the Think I got it working now, at least the bucket is now getting populated with archives on an archive run as well as no more errors in the logs. The error was sitting in front of the terminal after all. Thanks for your time and patience! |
Hey, sorry to reopen this, but after some more testing (and a vacation) some more questions came up specifically for backups and restore. Didn't want to open a new issue as this is still connected to my former issues. The restore command:
Sometimes the restore process simply fails at step
I know this can happen if there's already a postgres process running, but in those cases it usually said exactly that. Here I get no more details than this. There are no more errors in any of the logs, all of the restore processes show success. Last messages before the failure above are:
If I run the restore again it mostly works afterwards for that node, but then hangs for a very long time after the replicas have been restored while waiting for the master. There is just no output at all by ansible, even at highest verbosity level. I can see the command in htop on the master, but can't see anything happening. Last successful recovery took about 1,5 hours of waiting for the master node. Settings in
and
I've changed the cluster restore command as per the docs to
I can also post the full Then ran the playbook with And a general question because after reading the docs multiple times and trying different approaches I can't seem to find the correct way to restore a Cluster from backup after a complete loss of the cluster. Like when you have to redeploy the cluster to a whole new set of hosts because of a catastrophic failure. PITR seems the wrong approach as it expects an existing cluster. Deploying a new cluster and restoring like above always led to a system ID mismatch while/after restore or it fails due to mismatching stanzas. Cloning didn't work as well for some reason (I forgot to write down the error here, sorry, but I'll try this way again). So what would be the best/correct approach here? I'm sure I'm missing something to properly restore it from a complete loss. This setup is a lot to take in, so every input is much apprechiated. |
Hey, I've set both timeouts rather short to test, our test-data isn't huge so it shouldn't matter. The restore process itself is usually done within a minute. Will report back once done. |
I'm not sure why, as my ansible knowledge isn't that deep yet, but it seems the
So it's like before, it finished restoring on the replicas, but get's stuck at the master node. No ansible output whatsoever and even with timeout set to 180 Seconds it waits much longer than that. |
Hey guys, sorry, to bug you again, but any new Ideas on this? |
this week I plan to try to reproduce the problem and suggest an improvement as part of the PR. |
Thanks!
I then have to stop postgres & Remove the Not sure what to make of that, but it seems connected since it should stop postgres like it does on the replicas to then restore. |
So I did some more testing and the whole issue seems to be revolving around the Leader Node not being able to properly restart postgres after the restore.
run the playbook
and wait.
and afterwards
So basically I do manually what ansible is supposed to do from what I gathered.
I'm unsure what to make of this, but maybe it helps you find any potential Issues. It also does look like the restore is successful, at least I get no error by the application when I point it to use the cluster instead of it's local DB and the data seems to be all there. Small Addition/Edit: I also deleted some entries from the DB, then ran the restore like above and it all came back as supposed. |
Thanks for the details provided, it's really helpful. |
Thanks, this seems to work fine now! Updated the patroni For my understanding: You changed the handling of the leader node so it now runs a check to see if postgres is (still) up, then stop it if true on Lines 704-727? Then added further checks for starting it back up. One last question, now that restore is working fine: What is the best way to recover in case of a complete loss of the cluster? I'll backup the nodes in general as well, so I can restore those, but I wanna cover all my bases before going into production. You never know what kind of catastrophic failure may happen ;) I've had issues recreating the cluster nodes after deleting them on purpose, because it complained either about postgres not existing (when running with PITR command) or that the stanza already exist if creating a new cluster. Setting the bootstrap method to |
Great, I'm glad it worked out.
Yes, we have added additional checks to ensure the reliability of the recovery procedure.
This is described in the documentation https://postgresql-cluster.org/docs/management/restore
I recommend performing more recoveries to practice and build confidence in the restoration process. I personally manage hundreds of clusters and have been using this method for over 5 years for in-place recovery or cloning to a new cluster. Additionally, to ensure you always have reliable and timely assistance, you may consider purchasing an individual support package. |
I am closing this issue, if you have any problems or questions, feel free to open a new one. |
Hey,
first of all thanks for this project! Really appreciate the time and effort.
I'm kinda stuck though, not really sure how to proceed. I've had a cluster running already, but since we're still testing and I discovered some issues, I tore it down to redeploy it. The first deployment was on an older version of the ansible playbook though, so I'm not sure what changed. I compared my old version with the new and couldn't really spot much of a difference.
Basis are Debian 12 VMs on Proxmox 8 deployed via Terraform, so they are absolutely identical.
The patroni master Node works apparently, but the Replicas won't connect. According to the Logs they are in a different Cluster ID.
I've checked and tried the solution in #747 , but unfortunately it didn't do the trick for me. After removing the configs and cluster, it couldn't connect to postgresql anymore at all.
This is the output from the playbook:
And the logs on both Replica Hosts:
This is the relvant part in my inventory (I think):
I've replaced the cluster completely twice now, changing the cluster name in between installs, but to no avail.
Not sure what else you need to troubleshoot, just let me know which logs or config you need.
Thanks in advance!
The text was updated successfully, but these errors were encountered: