-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(operator): Restructure LokiStack metrics #12228
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I wonder, if it would be a good idea to have a metric exposing the state of all conditions on the LokiStack (similar to what kube_state_metrics does for the standard types) instead of just having a metric for the warnings. That way, we (or customers) could also produce alerts on the other conditions and, for example, automatically warn when a LokiStack fails to become ready for a while. |
What this PR does / why we need it:
The current implementation of the LokiStack metrics emitted by Loki Operator sometimes keeps old versions of the metrics. This leads to the metrics not properly reflecting the current state of the deployed LokiStack resources.
This PR replaces the current implementation which relied on being called by the reconciliation loop by an implementation that fetches the current information of the LokiStack resources by itself and generates metrics from that data. As the list of LokiStack resources should already be cached in the informer this should not put any additional stress on the Kubernetes API server.
Which issue(s) this PR fixes:
LOG-5253
Special notes for your reviewer:
lokistack_info
containing information about the deployed LokiStack resources as labels (currently limited to the "size")lokistack_status_condition
showing the current state of the Conditions available in the LokiStack status. This setup mirrors the metrics provided bykube_state_metrics
List
on each metrics scrape seems wasteful, but my theory is that this data should be available in the informer cache of the operator. I had a look at the apiserver metrics while running this operator version compared to a version without this change and it does look like this is true and the change does not have any effect on the apiserver.Checklist
CONTRIBUTING.md
guide (required)CHANGELOG.md
updated