2/3 PG units stuck in waiting/idle state, not moving to active/idle #668

ethanmye-rs · 2024-11-11T04:30:18Z

Steps to reproduce

a. I do not have a firm reproducer, but I ran into this issue upgrading from rev 429 to rev 468 in a charmed landscape deployment. I originally encountered the issue in rev 429, and based on a prior bug, expected refreshing to 468 would fix the issue. However, I still see my pg units not starting, in a "awaiting for member to start" state.
b. I did not encounter this issue on another cluster in an identical environment, so it seems somewhat random. The machines in the juju model are manual machines in Azure.

Essentially, 2/3 postgres units stay stuck in a "awaiting for member to start" state. They cycle through different waiting and executing states, but the PG units never actually start.

Expected behavior

I expect the other 2 units to start and enter a active/idle state. They have been in this state for >48 hours.

Actual behavior

see logs below, but the machines cycle through waiting/executing states, but never enter active/idle as expected.

Versions

Operating system: 22.04.4

Juju CLI: 3.5.4

Juju agent: 3.5.4

Charm revision: 468

LXD: n/a

Log output

juju debug log: https://paste.ubuntu.com/p/FzXnjMpNYz/
snap logs from one unit failing to start: https://paste.ubuntu.com/p/St8WZNn4GT/ (restart at the end of the log file)
snap logs from other unit failing to start: https://paste.ubuntu.com/p/BH3RXfZrTW/
snap logs from healthy unit: https://paste.ubuntu.com/p/b6bgSVZKYm/
pg snap services config: https://paste.ubuntu.com/p/xJJq6ktXm9/

Happy to provide more logs, details or access to the environment. Thanks.

syncronize-issues-to-jira · 2024-11-11T04:30:26Z

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-5931.

This message was autogenerated

dragomirp · 2024-11-11T11:18:56Z

Hi, @ethanmye-rs, it seems like Patroni's health endpoint is failing, likely due to the rejecting connections in the snap logs. This seems to be caused by Patroni calling pg_isready failing with exit code 1. Can you check if you have any Postgresql logs on the failing units from around the time they failed? They should be located in /var/snap/charmed-postgresql/common/var/log/postgresql/. There may be more details in the Patroni logs located in /var/snap/charmed-postgresql/common/var/log/patroni/

ethanmye-rs · 2024-11-12T20:05:13Z

Pulled the logs for the two failed units, located in google drive here. Not sure if there is anything sensitive in them (they are about ~200MB total, uncompressed) so I've made them Canonical internal. I will also pull the logs for the active/idle unit, but they are much larger, about 2.4G uncompressed.

ethanmye-rs · 2024-11-12T21:19:15Z

Thanks @marceloneppel very much for the help. Core issue was missing pg_wal data, which prevented the other two units from moving past starting state. A few misc things:

The core reason the replica machines could not start was missing pg_wal data. The data is missing because at an earlier point, postgres actually exhausted the 64GB vm disk and I was forced to restart the machine. This is probably why the replicas are missing pg_wal data. I believe pg clears the pg_wal data on reboot. However, the main database is not very large, maybe 6-7GB based on the data in /var/snap/charmed-postgresql/common/var/lib/postgresql/, so it is suprising that pg_wal is so large. It would be nice to set a charm limit on pg_wal to avoid getting into this issue.

Based on the logs, it seems like if upgrading from 429 -> 468, you will still see superfluous log entries for the charmed-postgresql.pgbackrest-service and similar failing to start.

It would also be nice to surface as a warning (either in juju or COS) if one of the patroni members is in anything but a streaming or running state. Having machines stuck in starting surfaced no errors.

For future reference, this is how we checked the patroni status:

curl <current active pg unit ip>:8008/cluster

and for reiniting the followers, to be run on one of the follower machines:

sudo -H -u snap_daemon charmed-postgresql.patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml reinit postgresql

You can query the state by looking at either the cluster/ endpoint or catting /var/snap/charmed-postgresql/common/var/log/patroni/patroni.log

ethanmye-rs added the bug Something isn't working label Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2/3 PG units stuck in waiting/idle state, not moving to active/idle #668

2/3 PG units stuck in waiting/idle state, not moving to active/idle #668

ethanmye-rs commented Nov 11, 2024

syncronize-issues-to-jira bot commented Nov 11, 2024

dragomirp commented Nov 11, 2024

ethanmye-rs commented Nov 12, 2024 •

edited

Loading

ethanmye-rs commented Nov 12, 2024 •

edited

Loading

2/3 PG units stuck in waiting/idle state, not moving to active/idle #668

2/3 PG units stuck in waiting/idle state, not moving to active/idle #668

Comments

ethanmye-rs commented Nov 11, 2024

Steps to reproduce

Expected behavior

Actual behavior

Versions

Log output

syncronize-issues-to-jira bot commented Nov 11, 2024

dragomirp commented Nov 11, 2024

ethanmye-rs commented Nov 12, 2024 • edited Loading

ethanmye-rs commented Nov 12, 2024 • edited Loading

ethanmye-rs commented Nov 12, 2024 •

edited

Loading

ethanmye-rs commented Nov 12, 2024 •

edited

Loading