Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The cluster stops functioning after replacing a failed unit then another unit failed #573

Open
nobuto-m opened this issue Aug 6, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@nobuto-m
Copy link

nobuto-m commented Aug 6, 2024

Steps to reproduce

  1. prepare a MAAS provider
  2. deploy a 3-node cluster
    juju deploy postgresql --base [email protected] --channel 14/stable -n 3
  3. take down one unit by simulating a hardware failure
  4. remove the failed unit from the model
  5. add another unit to have 3 nodes again
  6. take down another unit by simulating a hardware failure

Expected behavior

The cluster keeps working since there are two living nodes out of 3 (the quorum should be satisfied).

Actual behavior

The cluster is not operational.

$ sudo -u snap_daemon env PATRONI_LOG_LEVEL=DEBUG patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
2024-08-06 06:12:20,353 - DEBUG - Loading configuration from file /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml
2024-08-06 06:12:25,425 - INFO - waiting on raft
2024-08-06 06:12:30,425 - INFO - waiting on raft
2024-08-06 06:12:35,426 - INFO - waiting on raft
2024-08-06 06:12:40,426 - INFO - waiting on raft
^C
Aborted!

The charm states postgresql/0 is the primary but it's not true since there is no postgresql process running in postgresql/0 any longer.

$ juju status
Model     Controller       Cloud/Region  Version  SLA          Timestamp
postgres  maas-controller  maas/default  3.5.3    unsupported  06:17:39Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active    2/3  postgresql  14/stable  429  no       

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0*  active    idle   0        192.168.151.112  5432/tcp  Primary
postgresql/1   active    idle   1        192.168.151.113  5432/tcp  
postgresql/3   unknown   lost   3        192.168.151.115  5432/tcp  agent lost, see 'juju show-status-log postgresql/3'

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.112  machine-2  [email protected]  default  Deployed
1        started  192.168.151.113  machine-3  [email protected]  default  Deployed
3        down     192.168.151.115  machine-4  [email protected]  default  Deployed
$ juju run postgresql/leader get-primary
Running operation 7 with 1 task
  - task 8 on unit-postgresql-0

Waiting for task 8...
primary: postgresql/0
$ juju ssh postgresql/0 -- pgrep -af postgres
7638 /snap/charmed-postgresql/115/usr/bin/prometheus-postgres-exporter
36668 python3 /snap/charmed-postgresql/115/usr/bin/patroni /var/snap/charmed-postgresql/115/etc/patroni/patroni.yaml
36920 /usr/bin/python3 src/cluster_topology_observer.py http://192.168.151.112:8008 True /usr/bin/juju-exec postgresql/0 /var/lib/juju/agents/unit-postgresql-0/charm
37263 snap restart charmed-postgresql.patroni
37272 systemctl stop snap.charmed-postgresql.patroni.service
Connection to 192.168.151.112 closed.

^^^ no postgresql process.

initial status

$ juju status
Model     Controller       Cloud/Region  Version  SLA          Timestamp
postgres  maas-controller  maas/default  3.5.3    unsupported  05:40:11Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active      3  postgresql  14/stable  429  no       

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0*  active    idle   0        192.168.151.112  5432/tcp  Primary
postgresql/1   active    idle   1        192.168.151.113  5432/tcp  
postgresql/2   active    idle   2        192.168.151.114  5432/tcp  

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.112  machine-2  [email protected]  default  Deployed
1        started  192.168.151.113  machine-3  [email protected]  default  Deployed
2        started  192.168.151.114  machine-4  [email protected]  default  Deployed
$ sudo -u snap_daemon patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml topology
+ Cluster: postgresql (7399889811388793708) ------+-----------+----+-----------+
| Member         | Host            | Role         | State     | TL | Lag in MB |
+----------------+-----------------+--------------+-----------+----+-----------+
| postgresql-0   | 192.168.151.112 | Leader       | running   |  1 |           |
| + postgresql-1 | 192.168.151.113 | Sync Standby | streaming |  1 |         0 |
| + postgresql-2 | 192.168.151.114 | Replica      | streaming |  1 |         0 |
+----------------+-----------------+--------------+-----------+----+-----------+

after taking down postgresql-2 (non leader)

$ juju status
Model     Controller       Cloud/Region  Version  SLA          Timestamp
postgres  maas-controller  maas/default  3.5.3    unsupported  05:47:59Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active    2/3  postgresql  14/stable  429  no       

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0*  active    idle   0        192.168.151.112  5432/tcp  Primary
postgresql/1   active    idle   1        192.168.151.113  5432/tcp  
postgresql/2   unknown   lost   2        192.168.151.114  5432/tcp  agent lost, see 'juju show-status-log postgresql/2'

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.112  machine-2  [email protected]  default  Deployed
1        started  192.168.151.113  machine-3  [email protected]  default  Deployed
2        down     192.168.151.114  machine-4  [email protected]  default  Deployed
$ sudo -u snap_daemon patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml topology
+ Cluster: postgresql (7399889811388793708) ------+-----------+----+-----------+
| Member         | Host            | Role         | State     | TL | Lag in MB |
+----------------+-----------------+--------------+-----------+----+-----------+
| postgresql-0   | 192.168.151.112 | Leader       | running   |  1 |           |
| + postgresql-1 | 192.168.151.113 | Sync Standby | streaming |  1 |         0 |
+----------------+-----------------+--------------+-----------+----+-----------+

-> expected status

cleaning up postgresql/2 from the model

remove-machine --force was used instead of remove-unit since the machine/unit agent is no longer responding after the hardware failure.

$ juju remove-machine --force 2
WARNING This command will perform the following actions:
will remove machine 2
- will remove unit postgresql/2
- will remove storage pgdata/2

Continue [y/N]? y
$ juju status
Model     Controller       Cloud/Region  Version  SLA          Timestamp
postgres  maas-controller  maas/default  3.5.3    unsupported  05:49:43Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active      2  postgresql  14/stable  429  no       

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0*  active    idle   0        192.168.151.112  5432/tcp  Primary
postgresql/1   active    idle   1        192.168.151.113  5432/tcp  

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.112  machine-2  [email protected]  default  Deployed
1        started  192.168.151.113  machine-3  [email protected]  default  Deployed

adding another machine as the 3rd node (postgresql/3) in the cluster

$ juju add-unit postgresql
$ juju status
Model     Controller       Cloud/Region  Version  SLA          Timestamp
postgres  maas-controller  maas/default  3.5.3    unsupported  05:58:54Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active      3  postgresql  14/stable  429  no       

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0*  active    idle   0        192.168.151.112  5432/tcp  Primary
postgresql/1   active    idle   1        192.168.151.113  5432/tcp  
postgresql/3   active    idle   3        192.168.151.115  5432/tcp  

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.112  machine-2  [email protected]  default  Deployed
1        started  192.168.151.113  machine-3  [email protected]  default  Deployed
3        started  192.168.151.115  machine-4  [email protected]  default  Deployed
$ sudo -u snap_daemon patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml topology
+ Cluster: postgresql (7399889811388793708) ------+-----------+----+-----------+
| Member         | Host            | Role         | State     | TL | Lag in MB |
+----------------+-----------------+--------------+-----------+----+-----------+
| postgresql-0   | 192.168.151.112 | Leader       | running   |  1 |           |
| + postgresql-1 | 192.168.151.113 | Sync Standby | streaming |  1 |         0 |
| + postgresql-3 | 192.168.151.115 | Replica      | streaming |  1 |         0 |
+----------------+-----------------+--------------+-----------+----+-----------+

-> expected status

taking down postgresql/3

$ juju status
Model     Controller       Cloud/Region  Version  SLA          Timestamp
postgres  maas-controller  maas/default  3.5.3    unsupported  06:18:48Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active    2/3  postgresql  14/stable  429  no       

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0*  active    idle   0        192.168.151.112  5432/tcp  Primary
postgresql/1   active    idle   1        192.168.151.113  5432/tcp  
postgresql/3   unknown   lost   3        192.168.151.115  5432/tcp  agent lost, see 'juju show-status-log postgresql/3'

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.112  machine-2  [email protected]  default  Deployed
1        started  192.168.151.113  machine-3  [email protected]  default  Deployed
3        down     192.168.151.115  machine-4  [email protected]  default  Deployed

The cluster should still work at this point since there are two living nodes out of the 3-node cluster. However, no Patroni operation is possible any longer.

$ sudo -u snap_daemon env PATRONI_LOG_LEVEL=DEBUG patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml topology
2024-08-06 06:24:29,290 - DEBUG - Loading configuration from file /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml
2024-08-06 06:24:34,357 - INFO - waiting on raft
2024-08-06 06:24:39,358 - INFO - waiting on raft
2024-08-06 06:24:44,358 - INFO - waiting on raft
2024-08-06 06:24:49,359 - INFO - waiting on raft
2024-08-06 06:24:54,359 - INFO - waiting on raft
2024-08-06 06:24:59,359 - INFO - waiting on raft
2024-08-06 06:25:04,360 - INFO - waiting on raft
^C
Aborted!

[/var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml]

raft:
  data_dir: /var/snap/charmed-postgresql/current/etc/patroni/raft
  self_addr: '192.168.151.113:2222'
  partner_addrs:
  - 192.168.151.115:2222
  - 192.168.151.112:2222

The raft config in patroni.yaml looks okay though.

Versions

Operating system: jammy

Juju CLI: 3.5.3-genericlinux-amd64

Juju agent: 3.5.3

Charm revision: 14/stable 429

LXD: N/A

Log output

Juju debug log:

postgresql_replacing_failed_nodes_debug.log

Additional context

@nobuto-m nobuto-m added the bug Something isn't working label Aug 6, 2024
Copy link
Contributor

github-actions bot commented Aug 6, 2024

@taurus-forever
Copy link
Contributor

It is the same pySyncObj Raft library as described in #571 (comment)

Duplicate of #418, we are trying to fix this in https://warthogs.atlassian.net/browse/DPE-3684

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants