Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Actionable Observability] Compute current service level indicator and error budget consumption #142521

Closed
2 tasks
Tracked by #137323
kdelemme opened this issue Oct 3, 2022 · 4 comments · Fixed by #142784
Closed
2 tasks
Tracked by #137323
Assignees
Labels
Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" v8.6.0

Comments

@kdelemme
Copy link
Contributor

kdelemme commented Oct 3, 2022

📝 Summary

Part of #137323
Detailed algebra

For a given SLO, we want to compute the following information:

  • Current SLI value, e.g. Good / Total events over SLO time window
  • SLO Error Budget, e.g. 5% error budget for SLO target of 95%
  • Error Budget Consumed, e.g. 10% consumed of the error budget
  • Error Budget Left, e.g. 90% left on the error budget

With the above defined data, a user can understand if an SLO is currently met or not, how much of an error budget is consumed in percentage, e.g. "50% of your error budget has been consumed"

Example (stats) of the GET /slos/id

{
	"id": "fb7bea40-43e4-11ed-aa34-af16f78c8a81",
	"name": "My SLO Availability",
	"description": "99% o11y-app all services availablility",
	"indicator": {
		"type": "slo.apm.transaction_error_rate",
		"params": {
			"environment": "development",
			"service": "o11y-app",
			"transaction_type": "request",
			"transaction_name": "GET /flaky",
			"good_status_codes": [
				"2xx",
				"3xx",
				"4xx"
			]
		}
	},
	"time_window": {
		"duration": "7d",
		"is_rolling": true
	},
	"budgeting_method": "occurrences",
	"objective": {
		"target": 0.95
	},
	"summary": {
		"sli_value": 0.999227,
		"error_budget": {
			"initial": 0.05,
			"consumed": 0.015452,
			"remaining": 0.984548
		}
	},
	"revision": 1,
	"created_at": "2022-10-04T13:03:46.916Z",
	"updated_at": "2022-10-04T13:03:46.916Z"
}

❓Questions

  1. Should we include this information on the GET /slos/{id} route or create a sub route? I tend to prefer the former approach to avoid a second request from the frontend.
  2. Does the data makes sense? Are we missing something else?
  3. Can we found a better name for summary ?
  4. Do we want to cap the error budget consumed to 100% and the error budget left to 0%? e.g. a 99% SLO with 0 good and 100 total events would give a consumption of 10,000% of the error budget (100 times the error budget) if not capped

✅ Acceptance Criteria

  • Fetching an SLO returns the current sli value for the slo time window
  • Fetching an SLO returns the current error budget consumption details (initial error budget relative to the slo target, consumed and left relative to the initial error budget)
@kdelemme kdelemme self-assigned this Oct 3, 2022
@kdelemme kdelemme added the v8.6.0 label Oct 3, 2022
@botelastic botelastic bot added the needs-team Issues missing a team label label Oct 3, 2022
@kdelemme kdelemme added Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" and removed needs-team Issues missing a team label labels Oct 3, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/actionable-observability (Team: Actionable Observability)

@kdelemme kdelemme changed the title [Actionable Observability] Compute SLO current stats [Actionable Observability] Compute current service level indicator and error budget consumption Oct 4, 2022
@simianhacker
Copy link
Member

Should we include this information on the GET /slos/{id} route or create a sub route? I tend to prefer the former approach to avoid a second request from the frontend.

I would prefer the former as well for the same reason (from a Frontend developer perspective). My only concern is using this with Ansible/Terraform, is the expectation that when you hit the GET /api/slo/{id} resource it only includes the same definition created through the create api OR is it ok to also have some stats? We could always include an optional URL argument like GET /api/slo/{id}?withStats which would include a stats section.

Does the data makes sense? Are we missing something else?

@vinaychandrasekhar What do you think about having a burn rate stats with the "Google Recommended" breakdowns? It would look something like this:

"burn_rate": {
   "5m":  1.5,
   "30m": 1,
   "1h": 2,
   "6h": 0.01,
   "3d": 0.001
}

Can we found a better name for stats ?

I'm OK with stats

Do we want to cap the error budget consumed to 100% and the error budget left to 0%? e.g. a 99% SLO with 0 good and 100 total events would give a consumption of 10,000% of the error budget (100 times the error budget) if not capped

I vote for capping it BUT this does make me wonder if it we shouldn't have a way to indicate "You're over your budget by X amount"?

@vinaychandrasekhar
Copy link

vinaychandrasekhar commented Oct 5, 2022

Offering some default burn rates in the api response makes sense. If users have explicitly set up alerts for specific burn rates, including that set of burn rates would be good. For the defaults themselves (5m/30m/...) we may need to consider the defined SLO and offer the defaults accordingly. (For the Google example, the recommended params are for a 99.9% SLO. There's a separate section on other availability goals - see "Extreme Availability Goals" here). I wonder if we need to come up with a mapping of SLO range to default burn rates. Or, start simple and show either whatever the user has already chosen for their SLO, or offer two burn rates based on SLO and duration.

On capping, two considerations. First, readability: a consumption of 1000000% is not very readable or human usable (with or without comma separators). So a capping (with a clear indication that it's a capped number) at a sufficiently high number makes sense, as long as there's a way for the user to get to the actual number if they need it. In terms of presentation, perhaps even switch over to "X times over" or some such. Second, charting and trending: when charting consumed budget over time (over multiple SLO periods) we'll need to think through how capping plays a role.

@kdelemme
Copy link
Contributor Author

kdelemme commented Oct 5, 2022

Thanks for the feedback @vinaychandrasekhar @simianhacker. I agree with you in general, I would just clarify the burn rate recommended values as being something related to Alerts not SLO per see.

I wonder if we need to come up with a mapping of SLO range to default burn rates. Or, start simple and show either whatever the user has already chosen for their SLO, or offer two burn rates based on SLO and duration.

Agree, at least guide the user to use good values when defining a Burn Rate alerts, and prohibit the user from using a burn rate value that cannot be reached. 👇🏻 Some examples:
image

Also, regarding the burn rate alerts recommended values, I would suggest having them somewhere on the alerts API not the slo API. It's not useful when looking at the SLO, it's only useful when defining/editing a Burn Rate Alert on an SLO.

First, readability: a consumption of 1000000% is not very readable or human usable (with or without comma separators). So a capping (with a clear indication that it's a capped number) at a sufficiently high number makes sense, as long as there's a way for the user to get to the actual number if they need it. In terms of presentation, perhaps even switch over to "X times over" or some such.

Agree, 1000000% is not readable 🙈
From an API development and consumption perspectives, it might be tricky to have two different behaviours based on the value of the fields. I would prefer delivering a consistent behaviour and let the user decides what to do with it 👇🏻

What about not capping the error budget consumption, e.g we return a number between 0 (0%) and +infinity (technically, meaning (+infinity) * 100%). For example: 0.836 => 83.6%, 1.24 => 124.00% and 10 => 1000%
And we let the frontend/user decides how to deal with large number as they see fit. For example a number below 10, might be shown as percentage, e.g. "you have used 124% of your error budget available", but above 10 is shown as a "you consumed X times your error budget available".
This way we always return the value, and the user can decide what to do with it.

But capping the error budget remaining, e.g. a number from 0 and 1, meaning 0% left and 100% left respectively. Allowing negative number for the remaining budget doesn't mean anything really, e.g. "you have -999% left" 🤷🏻 . You either have all or some budget left, or none.

Second, charting and trending: when charting consumed budget over time (over multiple SLO periods) we'll need to think through how capping plays a role.

This is like an history of the SLI value and error budget over time. I would suggest to have this behind a separate API, e.g. /slos/id/history?from&to so we can graph the evolution of the SLI and error budget consumption over time.
I would say this is out of scope for this story. But nonetheless, I agree we'll need to figure out how capping works there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" v8.6.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants