Problems when upgrading cluster #666

rcknr · 2024-05-27T09:21:57Z

Last week I did a major version upgrade with the playbook and encountered a few issues which I want to share.

During the upgrade maintenance_enable and maintenance_disable roles are used. Their functions however are somewhat different: enable role disables confd and deploys a temporary configuration for haproxy which disables healthchecks but also stops patroni cluster (it also handles vip-manager tasks but I don't use that). Disable role, however, only deals with confd/haproxy/vip-manager but not patroni. These tasks are executed on database nodes, while conf/haproxy are deployed to balancers host group from the inventory. Therefore, when you run an upgrade the playbook is trying to stop confd on cluster nodes which don't have it and fails.
After I initially deployed my cluster I tried various settings to adjust my setup and once set an invalid value for log_timezone parameter. I have fixed that long ago but during the upgrade patroni got this old config from somewhere and tried to start new postgres version with that incorrect value which caused a failure loop. I couldn't figure out where it was coming from for a while but then I found patroni.dynmic.json file located in my data directory which was used to generate settings for the new version. I think that the best course of action would be to use the latest DCS config to start not that file which was somehow persisted in data directory.

So item 1 definitely looks like a bug to me, while item 2 is mostly my own mistake and lack of understanding of patroni configuration but I think it should be highlighted so others are aware of that during the upgrade.

The text was updated successfully, but these errors were encountered:

rcknr · 2024-05-27T11:50:09Z

Does it make sense to make the following changes?

Move stopping patroni cluster from maintenance_enable to stop_services role, leaving maintenance_enable and maintenance_disable roles to take care of confd/haproxy/vip-manager.
Extract maintenance_enable and maintenance_disable tasks from (5/6) UPGRADE: Upgrade PostgreSQL group to be executed before and after it correspondingly on balancers hosts.

If that's fine, I can produce a PR.

vitabaks · 2024-05-27T12:37:20Z

If the tasks are performed on the wrong nodes, then this is an mistake and must be performed on the appropriate host groups or use delegate_to

I'll take a look at it later.

vitabaks · 2024-07-15T22:07:04Z

Fixed here #699

vitabaks self-assigned this Jul 15, 2024

vitabaks added the bug Something isn't working label Jul 15, 2024

vitabaks mentioned this issue Jul 15, 2024

Fix (Upgrade): Delegate maintenance mode tasks to 'balancers' group #699

Merged

vitabaks closed this as completed Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems when upgrading cluster #666

Problems when upgrading cluster #666

rcknr commented May 27, 2024 •

edited

Loading

rcknr commented May 27, 2024 •

edited

Loading

vitabaks commented May 27, 2024 •

edited

Loading

vitabaks commented Jul 15, 2024

Problems when upgrading cluster #666

Problems when upgrading cluster #666

Comments

rcknr commented May 27, 2024 • edited Loading

rcknr commented May 27, 2024 • edited Loading

vitabaks commented May 27, 2024 • edited Loading

vitabaks commented Jul 15, 2024

rcknr commented May 27, 2024 •

edited

Loading

rcknr commented May 27, 2024 •

edited

Loading

vitabaks commented May 27, 2024 •

edited

Loading