-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Actionable Observability] Compute current service level indicator and error budget consumption #142521
Comments
Pinging @elastic/actionable-observability (Team: Actionable Observability) |
I would prefer the former as well for the same reason (from a Frontend developer perspective). My only concern is using this with Ansible/Terraform, is the expectation that when you hit the
@vinaychandrasekhar What do you think about having a burn rate stats with the "Google Recommended" breakdowns? It would look something like this:
I'm OK with
I vote for capping it BUT this does make me wonder if it we shouldn't have a way to indicate "You're over your budget by X amount"? |
Offering some default burn rates in the api response makes sense. If users have explicitly set up alerts for specific burn rates, including that set of burn rates would be good. For the defaults themselves (5m/30m/...) we may need to consider the defined SLO and offer the defaults accordingly. (For the Google example, the recommended params are for a 99.9% SLO. There's a separate section on other availability goals - see "Extreme Availability Goals" here). I wonder if we need to come up with a mapping of SLO range to default burn rates. Or, start simple and show either whatever the user has already chosen for their SLO, or offer two burn rates based on SLO and duration. On capping, two considerations. First, readability: a consumption of 1000000% is not very readable or human usable (with or without comma separators). So a capping (with a clear indication that it's a capped number) at a sufficiently high number makes sense, as long as there's a way for the user to get to the actual number if they need it. In terms of presentation, perhaps even switch over to "X times over" or some such. Second, charting and trending: when charting consumed budget over time (over multiple SLO periods) we'll need to think through how capping plays a role. |
Thanks for the feedback @vinaychandrasekhar @simianhacker. I agree with you in general, I would just clarify the burn rate recommended values as being something related to Alerts not SLO per see.
Agree, at least guide the user to use good values when defining a Burn Rate alerts, and prohibit the user from using a burn rate value that cannot be reached. 👇🏻 Some examples: Also, regarding the burn rate alerts recommended values, I would suggest having them somewhere on the alerts API not the slo API. It's not useful when looking at the SLO, it's only useful when defining/editing a Burn Rate Alert on an SLO.
Agree, 1000000% is not readable 🙈 What about not capping the error budget consumption, e.g we return a number between 0 (0%) and +infinity (technically, meaning (+infinity) * 100%). For example: But capping the error budget remaining, e.g. a number from 0 and 1, meaning 0% left and 100% left respectively. Allowing negative number for the remaining budget doesn't mean anything really, e.g. "you have -999% left" 🤷🏻 . You either have all or some budget left, or none.
This is like an history of the SLI value and error budget over time. I would suggest to have this behind a separate API, e.g. |
📝 Summary
Part of #137323
Detailed algebra
For a given SLO, we want to compute the following information:
With the above defined data, a user can understand if an SLO is currently met or not, how much of an error budget is consumed in percentage, e.g. "50% of your error budget has been consumed"
Example (stats) of the GET /slos/id
❓Questions
GET /slos/{id}
route or create a sub route? I tend to prefer the former approach to avoid a second request from the frontend.summary
?✅ Acceptance Criteria
The text was updated successfully, but these errors were encountered: