Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Principal=>subordinate coordination for logrotation (grafana-agent causing a lot of unreleased files and huge syslog) #153

Open
taurus-forever opened this issue Jul 16, 2024 · 5 comments

Comments

@taurus-forever
Copy link

Bug Description

Hi,

Please check the complete issue description in PostgreSQL repo:
canonical/postgresql-operator#524

TL;DR: PostgreSQL charm rotates logs but doesn't send any signals to subordinated grafana-agent causing a lot of unreleased files and huge syslog => downtime.

It is a cross-team ticket to build a solution here.

To Reproduce

See steps to reproduce in canonical/postgresql-operator#524

Environment

See Versions in canonical/postgresql-operator#524

Relevant log output

No relevant logs output.

Additional context

Proposals:

  • (preferred) subordinated charm(s) should bring/install logrotate config, so principal charm will execute all of them.
  • principal charm should (somehow) detects subordinates and inform them (somehow) using signal (SIGHUP?)
    Better ideas are welcome!
@taurus-forever
Copy link
Author

@simskij do you have good ideas how to proceed here?
DB charm should somehow aware about COS promtail to SIGHUP it properly.

@lucabello
Copy link
Contributor

lucabello commented Aug 30, 2024

@taurus-forever Do you think this would be solved by not getting all the files from /var/log/** and instead only get a few that you specify?

@taurus-forever
Copy link
Author

taurus-forever commented Sep 6, 2024

@lucabello AFAIK, no. The grafana-agent charm / promtail binary (did?) read some log files from disk (to send them to Loki). PostgreSQL charm rotates logs, but didn't send any signals to COS to close the current descriptor and reopen the file (as the old one moved to archive folder).

IMHO, we have two options:

  • send all logs through pebble and never read logs on disk. AFAIK it is a current COS plan. Pros: 1) VM has no Pebble. Journald has performance issues, etc. 2) some services (e.g. pgbackest) requires complex files structure and do not support syslog approach.
  • teach principal/PostgreSQL charm to send signals on logrorate to subordinated/grafa-agent.
  • ...
    Better ideas are welcome!

@sed-i
Copy link
Contributor

sed-i commented Sep 6, 2024

  • Grafana folks recommend using the "rename and create" rotation strategy over "copy and truncate". Do you know what rotation strategy is set up for postgresql? Is that configurable on your side?
  • Are you saying that slurping the archive patroni/patroni.log.* is incorrect, and we should only be reading the one file patroni/patroni.log? That could be related to the glob "/var/log/**/*log" we have in place. But I'm surprised that filenames such as patroni.log.6545 or patroni.log.10080 would match the glob. I confirmed that python's glob meets my expectation ([print(p.name) for p in Path(".").glob("**/*log")]) but not sure how it's impl'd in gagent.

I'd appreciate your input on the above @taurus-forever.

@dragomirp
Copy link
Contributor

  • Grafana folks recommend using the "rename and create" rotation strategy over "copy and truncate". Do you know what rotation strategy is set up for postgresql? Is that configurable on your side?

Patroni should be using an extended version of Python's RotatingFileHandler. Python's docs indicate “rename and create”. We can only configure size and amount of files to keep.

Postgresql is configured to keep a week's worth of per minute logs that are truncated each minute. This is behaviour configured by us, but it is spec behaviour so discussions would be necessary to change it.

Both Patroni and Postgresql try to keep about a week's worth of per minute logs, so that should be about 10k files for each.

  • Are you saying that slurping the archive patroni/patroni.log.* is incorrect, and we should only be reading the one file patroni/patroni.log? That could be related to the glob "/var/log/**/*log" we have in place. But I'm surprised that filenames such as patroni.log.6545 or patroni.log.10080 would match the glob. I confirmed that python's glob meets my expectation ([print(p.name) for p in Path(".").glob("**/*log")]) but not sure how it's impl'd in gagent.

I don't know how the agent detects log changes, but for Patroni only the last few logs should be relevant, the deeper backlog was likely already synced and shouldn't be changing. For Postgresql things are trickier, since the files are the same (postgresql-%w_%H%M.log) but will be overwritten when the same time comes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants