Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems when upgrading cluster #666

Closed
rcknr opened this issue May 27, 2024 · 3 comments
Closed

Problems when upgrading cluster #666

rcknr opened this issue May 27, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@rcknr
Copy link
Contributor

rcknr commented May 27, 2024

Last week I did a major version upgrade with the playbook and encountered a few issues which I want to share.

  1. During the upgrade maintenance_enable and maintenance_disable roles are used. Their functions however are somewhat different: enable role disables confd and deploys a temporary configuration for haproxy which disables healthchecks but also stops patroni cluster (it also handles vip-manager tasks but I don't use that). Disable role, however, only deals with confd/haproxy/vip-manager but not patroni. These tasks are executed on database nodes, while conf/haproxy are deployed to balancers host group from the inventory. Therefore, when you run an upgrade the playbook is trying to stop confd on cluster nodes which don't have it and fails.
  2. After I initially deployed my cluster I tried various settings to adjust my setup and once set an invalid value for log_timezone parameter. I have fixed that long ago but during the upgrade patroni got this old config from somewhere and tried to start new postgres version with that incorrect value which caused a failure loop. I couldn't figure out where it was coming from for a while but then I found patroni.dynmic.json file located in my data directory which was used to generate settings for the new version. I think that the best course of action would be to use the latest DCS config to start not that file which was somehow persisted in data directory.

So item 1 definitely looks like a bug to me, while item 2 is mostly my own mistake and lack of understanding of patroni configuration but I think it should be highlighted so others are aware of that during the upgrade.

@rcknr
Copy link
Contributor Author

rcknr commented May 27, 2024

Does it make sense to make the following changes?

  • Move stopping patroni cluster from maintenance_enable to stop_services role, leaving maintenance_enable and maintenance_disable roles to take care of confd/haproxy/vip-manager.
  • Extract maintenance_enable and maintenance_disable tasks from (5/6) UPGRADE: Upgrade PostgreSQL group to be executed before and after it correspondingly on balancers hosts.

If that's fine, I can produce a PR.

@vitabaks
Copy link
Owner

vitabaks commented May 27, 2024

If the tasks are performed on the wrong nodes, then this is an mistake and must be performed on the appropriate host groups or use delegate_to

I'll take a look at it later.

@vitabaks
Copy link
Owner

Fixed here #699

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants
@rcknr @vitabaks and others