You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
take down one unit by simulating a hardware failure
remove the failed unit from the model
add another unit to have 3 nodes again
take down another unit by simulating a hardware failure
Expected behavior
The cluster keeps working since there are two living nodes out of 3 (the quorum should be satisfied).
Actual behavior
The cluster is not operational.
$ sudo -u snap_daemon env PATRONI_LOG_LEVEL=DEBUG patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
2024-08-06 06:12:20,353 - DEBUG - Loading configuration from file /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml
2024-08-06 06:12:25,425 - INFO - waiting on raft
2024-08-06 06:12:30,425 - INFO - waiting on raft
2024-08-06 06:12:35,426 - INFO - waiting on raft
2024-08-06 06:12:40,426 - INFO - waiting on raft
^C
Aborted!
The charm states postgresql/0 is the primary but it's not true since there is no postgresql process running in postgresql/0 any longer.
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
postgres maas-controller maas/default 3.5.3 unsupported 06:17:39Z
App Version Status Scale Charm Channel Rev Exposed Message
postgresql 14.11 active 2/3 postgresql 14/stable 429 no
Unit Workload Agent Machine Public address Ports Message
postgresql/0* active idle 0 192.168.151.112 5432/tcp Primary
postgresql/1 active idle 1 192.168.151.113 5432/tcp
postgresql/3 unknown lost 3 192.168.151.115 5432/tcp agent lost, see 'juju show-status-log postgresql/3'
Machine State Address Inst id Base AZ Message
0 started 192.168.151.112 machine-2 [email protected] default Deployed
1 started 192.168.151.113 machine-3 [email protected] default Deployed
3 down 192.168.151.115 machine-4 [email protected] default Deployed
$ juju run postgresql/leader get-primary
Running operation 7 with 1 task
- task 8 on unit-postgresql-0
Waiting for task 8...
primary: postgresql/0
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
postgres maas-controller maas/default 3.5.3 unsupported 05:40:11Z
App Version Status Scale Charm Channel Rev Exposed Message
postgresql 14.11 active 3 postgresql 14/stable 429 no
Unit Workload Agent Machine Public address Ports Message
postgresql/0* active idle 0 192.168.151.112 5432/tcp Primary
postgresql/1 active idle 1 192.168.151.113 5432/tcp
postgresql/2 active idle 2 192.168.151.114 5432/tcp
Machine State Address Inst id Base AZ Message
0 started 192.168.151.112 machine-2 [email protected] default Deployed
1 started 192.168.151.113 machine-3 [email protected] default Deployed
2 started 192.168.151.114 machine-4 [email protected] default Deployed
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
postgres maas-controller maas/default 3.5.3 unsupported 05:47:59Z
App Version Status Scale Charm Channel Rev Exposed Message
postgresql 14.11 active 2/3 postgresql 14/stable 429 no
Unit Workload Agent Machine Public address Ports Message
postgresql/0* active idle 0 192.168.151.112 5432/tcp Primary
postgresql/1 active idle 1 192.168.151.113 5432/tcp
postgresql/2 unknown lost 2 192.168.151.114 5432/tcp agent lost, see 'juju show-status-log postgresql/2'
Machine State Address Inst id Base AZ Message
0 started 192.168.151.112 machine-2 [email protected] default Deployed
1 started 192.168.151.113 machine-3 [email protected] default Deployed
2 down 192.168.151.114 machine-4 [email protected] default Deployed
remove-machine --force was used instead of remove-unit since the machine/unit agent is no longer responding after the hardware failure.
$ juju remove-machine --force 2
WARNING This command will perform the following actions:
will remove machine 2
- will remove unit postgresql/2
- will remove storage pgdata/2
Continue [y/N]? y
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
postgres maas-controller maas/default 3.5.3 unsupported 05:49:43Z
App Version Status Scale Charm Channel Rev Exposed Message
postgresql 14.11 active 2 postgresql 14/stable 429 no
Unit Workload Agent Machine Public address Ports Message
postgresql/0* active idle 0 192.168.151.112 5432/tcp Primary
postgresql/1 active idle 1 192.168.151.113 5432/tcp
Machine State Address Inst id Base AZ Message
0 started 192.168.151.112 machine-2 [email protected] default Deployed
1 started 192.168.151.113 machine-3 [email protected] default Deployed
adding another machine as the 3rd node (postgresql/3) in the cluster
$ juju add-unit postgresql
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
postgres maas-controller maas/default 3.5.3 unsupported 05:58:54Z
App Version Status Scale Charm Channel Rev Exposed Message
postgresql 14.11 active 3 postgresql 14/stable 429 no
Unit Workload Agent Machine Public address Ports Message
postgresql/0* active idle 0 192.168.151.112 5432/tcp Primary
postgresql/1 active idle 1 192.168.151.113 5432/tcp
postgresql/3 active idle 3 192.168.151.115 5432/tcp
Machine State Address Inst id Base AZ Message
0 started 192.168.151.112 machine-2 [email protected] default Deployed
1 started 192.168.151.113 machine-3 [email protected] default Deployed
3 started 192.168.151.115 machine-4 [email protected] default Deployed
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
postgres maas-controller maas/default 3.5.3 unsupported 06:18:48Z
App Version Status Scale Charm Channel Rev Exposed Message
postgresql 14.11 active 2/3 postgresql 14/stable 429 no
Unit Workload Agent Machine Public address Ports Message
postgresql/0* active idle 0 192.168.151.112 5432/tcp Primary
postgresql/1 active idle 1 192.168.151.113 5432/tcp
postgresql/3 unknown lost 3 192.168.151.115 5432/tcp agent lost, see 'juju show-status-log postgresql/3'
Machine State Address Inst id Base AZ Message
0 started 192.168.151.112 machine-2 [email protected] default Deployed
1 started 192.168.151.113 machine-3 [email protected] default Deployed
3 down 192.168.151.115 machine-4 [email protected] default Deployed
The cluster should still work at this point since there are two living nodes out of the 3-node cluster. However, no Patroni operation is possible any longer.
$ sudo -u snap_daemon env PATRONI_LOG_LEVEL=DEBUG patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml topology
2024-08-06 06:24:29,290 - DEBUG - Loading configuration from file /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml
2024-08-06 06:24:34,357 - INFO - waiting on raft
2024-08-06 06:24:39,358 - INFO - waiting on raft
2024-08-06 06:24:44,358 - INFO - waiting on raft
2024-08-06 06:24:49,359 - INFO - waiting on raft
2024-08-06 06:24:54,359 - INFO - waiting on raft
2024-08-06 06:24:59,359 - INFO - waiting on raft
2024-08-06 06:25:04,360 - INFO - waiting on raft
^C
Aborted!
Steps to reproduce
juju deploy postgresql --base [email protected] --channel 14/stable -n 3
Expected behavior
The cluster keeps working since there are two living nodes out of 3 (the quorum should be satisfied).
Actual behavior
The cluster is not operational.
The charm states postgresql/0 is the primary but it's not true since there is no postgresql process running in postgresql/0 any longer.
^^^ no postgresql process.
initial status
after taking down postgresql-2 (non leader)
-> expected status
cleaning up postgresql/2 from the model
remove-machine --force
was used instead ofremove-unit
since the machine/unit agent is no longer responding after the hardware failure.adding another machine as the 3rd node (postgresql/3) in the cluster
-> expected status
taking down postgresql/3
The cluster should still work at this point since there are two living nodes out of the 3-node cluster. However, no Patroni operation is possible any longer.
[/var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml]
The raft config in patroni.yaml looks okay though.
Versions
Operating system: jammy
Juju CLI: 3.5.3-genericlinux-amd64
Juju agent: 3.5.3
Charm revision: 14/stable 429
LXD: N/A
Log output
Juju debug log:
postgresql_replacing_failed_nodes_debug.log
Additional context
The text was updated successfully, but these errors were encountered: