Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2/3 PG units stuck in waiting/idle state, not moving to active/idle #668

Open
ethanmye-rs opened this issue Nov 11, 2024 · 4 comments
Open
Labels
bug Something isn't working

Comments

@ethanmye-rs
Copy link
Member

Steps to reproduce

a. I do not have a firm reproducer, but I ran into this issue upgrading from rev 429 to rev 468 in a charmed landscape deployment. I originally encountered the issue in rev 429, and based on a prior bug, expected refreshing to 468 would fix the issue. However, I still see my pg units not starting, in a "awaiting for member to start" state.
b. I did not encounter this issue on another cluster in an identical environment, so it seems somewhat random. The machines in the juju model are manual machines in Azure.

  1. Essentially, 2/3 postgres units stay stuck in a "awaiting for member to start" state. They cycle through different waiting and executing states, but the PG units never actually start.

Expected behavior

I expect the other 2 units to start and enter a active/idle state. They have been in this state for >48 hours.

Actual behavior

image

see logs below, but the machines cycle through waiting/executing states, but never enter active/idle as expected.

Versions

Operating system: 22.04.4

Juju CLI: 3.5.4

Juju agent: 3.5.4

Charm revision: 468

LXD: n/a

Log output

juju debug log: https://paste.ubuntu.com/p/FzXnjMpNYz/
snap logs from one unit failing to start: https://paste.ubuntu.com/p/St8WZNn4GT/ (restart at the end of the log file)
snap logs from other unit failing to start: https://paste.ubuntu.com/p/BH3RXfZrTW/
snap logs from healthy unit: https://paste.ubuntu.com/p/b6bgSVZKYm/
pg snap services config: https://paste.ubuntu.com/p/xJJq6ktXm9/

Happy to provide more logs, details or access to the environment. Thanks.

@ethanmye-rs ethanmye-rs added the bug Something isn't working label Nov 11, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-5931.

This message was autogenerated

@dragomirp
Copy link
Contributor

Hi, @ethanmye-rs, it seems like Patroni's health endpoint is failing, likely due to the rejecting connections in the snap logs. This seems to be caused by Patroni calling pg_isready failing with exit code 1. Can you check if you have any Postgresql logs on the failing units from around the time they failed? They should be located in /var/snap/charmed-postgresql/common/var/log/postgresql/. There may be more details in the Patroni logs located in /var/snap/charmed-postgresql/common/var/log/patroni/

@ethanmye-rs
Copy link
Member Author

ethanmye-rs commented Nov 12, 2024

Pulled the logs for the two failed units, located in google drive here. Not sure if there is anything sensitive in them (they are about ~200MB total, uncompressed) so I've made them Canonical internal. I will also pull the logs for the active/idle unit, but they are much larger, about 2.4G uncompressed.

@ethanmye-rs
Copy link
Member Author

ethanmye-rs commented Nov 12, 2024

Thanks @marceloneppel very much for the help. Core issue was missing pg_wal data, which prevented the other two units from moving past starting state. A few misc things:

The core reason the replica machines could not start was missing pg_wal data. The data is missing because at an earlier point, postgres actually exhausted the 64GB vm disk and I was forced to restart the machine. This is probably why the replicas are missing pg_wal data. I believe pg clears the pg_wal data on reboot. However, the main database is not very large, maybe 6-7GB based on the data in /var/snap/charmed-postgresql/common/var/lib/postgresql/, so it is suprising that pg_wal is so large. It would be nice to set a charm limit on pg_wal to avoid getting into this issue.

Based on the logs, it seems like if upgrading from 429 -> 468, you will still see superfluous log entries for the charmed-postgresql.pgbackrest-service and similar failing to start.

It would also be nice to surface as a warning (either in juju or COS) if one of the patroni members is in anything but a streaming or running state. Having machines stuck in starting surfaced no errors.

For future reference, this is how we checked the patroni status:

curl <current active pg unit ip>:8008/cluster

and for reiniting the followers, to be run on one of the follower machines:

sudo -H -u snap_daemon charmed-postgresql.patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml reinit postgresql

You can query the state by looking at either the cluster/ endpoint or catting /var/snap/charmed-postgresql/common/var/log/patroni/patroni.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants