Principal=>subordinate coordination for logrotation (grafana-agent causing a lot of unreleased files and huge syslog) #153

taurus-forever · 2024-07-16T13:32:26Z

Bug Description

Hi,

Please check the complete issue description in PostgreSQL repo:
canonical/postgresql-operator#524

TL;DR: PostgreSQL charm rotates logs but doesn't send any signals to subordinated grafana-agent causing a lot of unreleased files and huge syslog => downtime.

It is a cross-team ticket to build a solution here.

To Reproduce

See steps to reproduce in canonical/postgresql-operator#524

Environment

See Versions in canonical/postgresql-operator#524

Relevant log output

No relevant logs output.

Additional context

Proposals:

(preferred) subordinated charm(s) should bring/install logrotate config, so principal charm will execute all of them.
principal charm should (somehow) detects subordinates and inform them (somehow) using signal (SIGHUP?)
Better ideas are welcome!

taurus-forever · 2024-08-05T12:44:55Z

@simskij do you have good ideas how to proceed here?
DB charm should somehow aware about COS promtail to SIGHUP it properly.

lucabello · 2024-08-30T13:57:17Z

@taurus-forever Do you think this would be solved by not getting all the files from /var/log/** and instead only get a few that you specify?

taurus-forever · 2024-09-06T10:08:17Z

@lucabello AFAIK, no. The grafana-agent charm / promtail binary (did?) read some log files from disk (to send them to Loki). PostgreSQL charm rotates logs, but didn't send any signals to COS to close the current descriptor and reopen the file (as the old one moved to archive folder).

IMHO, we have two options:

send all logs through pebble and never read logs on disk. AFAIK it is a current COS plan. Pros: 1) VM has no Pebble. Journald has performance issues, etc. 2) some services (e.g. pgbackest) requires complex files structure and do not support syslog approach.
teach principal/PostgreSQL charm to send signals on logrorate to subordinated/grafa-agent.
...
Better ideas are welcome!

sed-i · 2024-09-06T13:09:12Z

Grafana folks recommend using the "rename and create" rotation strategy over "copy and truncate". Do you know what rotation strategy is set up for postgresql? Is that configurable on your side?
Are you saying that slurping the archive patroni/patroni.log.* is incorrect, and we should only be reading the one file patroni/patroni.log? That could be related to the glob "/var/log/**/*log" we have in place. But I'm surprised that filenames such as patroni.log.6545 or patroni.log.10080 would match the glob. I confirmed that python's glob meets my expectation ([print(p.name) for p in Path(".").glob("**/*log")]) but not sure how it's impl'd in gagent.

I'd appreciate your input on the above @taurus-forever.

dragomirp · 2024-09-09T14:11:57Z

Grafana folks recommend using the "rename and create" rotation strategy over "copy and truncate". Do you know what rotation strategy is set up for postgresql? Is that configurable on your side?

Patroni should be using an extended version of Python's RotatingFileHandler. Python's docs indicate “rename and create”. We can only configure size and amount of files to keep.

Postgresql is configured to keep a week's worth of per minute logs that are truncated each minute. This is behaviour configured by us, but it is spec behaviour so discussions would be necessary to change it.

Both Patroni and Postgresql try to keep about a week's worth of per minute logs, so that should be about 10k files for each.

Are you saying that slurping the archive patroni/patroni.log.* is incorrect, and we should only be reading the one file patroni/patroni.log? That could be related to the glob "/var/log/**/*log" we have in place. But I'm surprised that filenames such as patroni.log.6545 or patroni.log.10080 would match the glob. I confirmed that python's glob meets my expectation ([print(p.name) for p in Path(".").glob("**/*log")]) but not sure how it's impl'd in gagent.

I don't know how the agent detects log changes, but for Patroni only the last few logs should be relevant, the deeper backlog was likely already synced and shouldn't be changing. For Postgresql things are trickier, since the files are the same (postgresql-%w_%H%M.log) but will be overwritten when the same time comes.

taurus-forever added Status: Triage Type: Bug labels Jul 16, 2024

taurus-forever mentioned this issue Jul 16, 2024

Patroni logs with grafana-agent causing a lot of unrelease file and huge syslog canonical/postgresql-operator#524

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Principal=>subordinate coordination for logrotation (grafana-agent causing a lot of unreleased files and huge syslog) #153

Principal=>subordinate coordination for logrotation (grafana-agent causing a lot of unreleased files and huge syslog) #153

taurus-forever commented Jul 16, 2024

taurus-forever commented Aug 5, 2024

lucabello commented Aug 30, 2024 •

edited

Loading

taurus-forever commented Sep 6, 2024 •

edited

Loading

sed-i commented Sep 6, 2024

dragomirp commented Sep 9, 2024

Principal=>subordinate coordination for logrotation (grafana-agent causing a lot of unreleased files and huge syslog) #153

Principal=>subordinate coordination for logrotation (grafana-agent causing a lot of unreleased files and huge syslog) #153

Comments

taurus-forever commented Jul 16, 2024

Bug Description

To Reproduce

Environment

Relevant log output

Additional context

taurus-forever commented Aug 5, 2024

lucabello commented Aug 30, 2024 • edited Loading

taurus-forever commented Sep 6, 2024 • edited Loading

sed-i commented Sep 6, 2024

dragomirp commented Sep 9, 2024

lucabello commented Aug 30, 2024 •

edited

Loading

taurus-forever commented Sep 6, 2024 •

edited

Loading