[Actionable Observability] Compute current service level indicator and error budget consumption #142521

kdelemme · 2022-10-03T19:47:20Z

📝 Summary

For a given SLO, we want to compute the following information:

Current SLI value, e.g. Good / Total events over SLO time window
SLO Error Budget, e.g. 5% error budget for SLO target of 95%
Error Budget Consumed, e.g. 10% consumed of the error budget
Error Budget Left, e.g. 90% left on the error budget

With the above defined data, a user can understand if an SLO is currently met or not, how much of an error budget is consumed in percentage, e.g. "50% of your error budget has been consumed"

Example (stats) of the GET /slos/id

{
	"id": "fb7bea40-43e4-11ed-aa34-af16f78c8a81",
	"name": "My SLO Availability",
	"description": "99% o11y-app all services availablility",
	"indicator": {
		"type": "slo.apm.transaction_error_rate",
		"params": {
			"environment": "development",
			"service": "o11y-app",
			"transaction_type": "request",
			"transaction_name": "GET /flaky",
			"good_status_codes": [
				"2xx",
				"3xx",
				"4xx"
			]
		}
	},
	"time_window": {
		"duration": "7d",
		"is_rolling": true
	},
	"budgeting_method": "occurrences",
	"objective": {
		"target": 0.95
	},
	"summary": {
		"sli_value": 0.999227,
		"error_budget": {
			"initial": 0.05,
			"consumed": 0.015452,
			"remaining": 0.984548
		}
	},
	"revision": 1,
	"created_at": "2022-10-04T13:03:46.916Z",
	"updated_at": "2022-10-04T13:03:46.916Z"
}

❓Questions

Should we include this information on the GET /slos/{id} route or create a sub route? I tend to prefer the former approach to avoid a second request from the frontend.
Does the data makes sense? Are we missing something else?
Can we found a better name for summary ?
Do we want to cap the error budget consumed to 100% and the error budget left to 0%? e.g. a 99% SLO with 0 good and 100 total events would give a consumption of 10,000% of the error budget (100 times the error budget) if not capped

✅ Acceptance Criteria

Fetching an SLO returns the current sli value for the slo time window
Fetching an SLO returns the current error budget consumption details (initial error budget relative to the slo target, consumed and left relative to the initial error budget)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-10-03T19:47:43Z

Pinging @elastic/actionable-observability (Team: Actionable Observability)

simianhacker · 2022-10-05T19:38:40Z

Should we include this information on the GET /slos/{id} route or create a sub route? I tend to prefer the former approach to avoid a second request from the frontend.

I would prefer the former as well for the same reason (from a Frontend developer perspective). My only concern is using this with Ansible/Terraform, is the expectation that when you hit the GET /api/slo/{id} resource it only includes the same definition created through the create api OR is it ok to also have some stats? We could always include an optional URL argument like GET /api/slo/{id}?withStats which would include a stats section.

Does the data makes sense? Are we missing something else?

@vinaychandrasekhar What do you think about having a burn rate stats with the "Google Recommended" breakdowns? It would look something like this:

"burn_rate": {
   "5m":  1.5,
   "30m": 1,
   "1h": 2,
   "6h": 0.01,
   "3d": 0.001
}

Can we found a better name for stats ?

I'm OK with stats

Do we want to cap the error budget consumed to 100% and the error budget left to 0%? e.g. a 99% SLO with 0 good and 100 total events would give a consumption of 10,000% of the error budget (100 times the error budget) if not capped

I vote for capping it BUT this does make me wonder if it we shouldn't have a way to indicate "You're over your budget by X amount"?

vinaychandrasekhar · 2022-10-05T19:57:55Z

Offering some default burn rates in the api response makes sense. If users have explicitly set up alerts for specific burn rates, including that set of burn rates would be good. For the defaults themselves (5m/30m/...) we may need to consider the defined SLO and offer the defaults accordingly. (For the Google example, the recommended params are for a 99.9% SLO. There's a separate section on other availability goals - see "Extreme Availability Goals" here). I wonder if we need to come up with a mapping of SLO range to default burn rates. Or, start simple and show either whatever the user has already chosen for their SLO, or offer two burn rates based on SLO and duration.

On capping, two considerations. First, readability: a consumption of 1000000% is not very readable or human usable (with or without comma separators). So a capping (with a clear indication that it's a capped number) at a sufficiently high number makes sense, as long as there's a way for the user to get to the actual number if they need it. In terms of presentation, perhaps even switch over to "X times over" or some such. Second, charting and trending: when charting consumed budget over time (over multiple SLO periods) we'll need to think through how capping plays a role.

kdelemme · 2022-10-05T21:11:04Z

Thanks for the feedback @vinaychandrasekhar @simianhacker. I agree with you in general, I would just clarify the burn rate recommended values as being something related to Alerts not SLO per see.

I wonder if we need to come up with a mapping of SLO range to default burn rates. Or, start simple and show either whatever the user has already chosen for their SLO, or offer two burn rates based on SLO and duration.

Agree, at least guide the user to use good values when defining a Burn Rate alerts, and prohibit the user from using a burn rate value that cannot be reached. 👇🏻 Some examples:

Also, regarding the burn rate alerts recommended values, I would suggest having them somewhere on the alerts API not the slo API. It's not useful when looking at the SLO, it's only useful when defining/editing a Burn Rate Alert on an SLO.

First, readability: a consumption of 1000000% is not very readable or human usable (with or without comma separators). So a capping (with a clear indication that it's a capped number) at a sufficiently high number makes sense, as long as there's a way for the user to get to the actual number if they need it. In terms of presentation, perhaps even switch over to "X times over" or some such.

Agree, 1000000% is not readable 🙈
From an API development and consumption perspectives, it might be tricky to have two different behaviours based on the value of the fields. I would prefer delivering a consistent behaviour and let the user decides what to do with it 👇🏻

What about not capping the error budget consumption, e.g we return a number between 0 (0%) and +infinity (technically, meaning (+infinity) * 100%). For example: 0.836 => 83.6%, 1.24 => 124.00% and 10 => 1000%
And we let the frontend/user decides how to deal with large number as they see fit. For example a number below 10, might be shown as percentage, e.g. "you have used 124% of your error budget available", but above 10 is shown as a "you consumed X times your error budget available".
This way we always return the value, and the user can decide what to do with it.

But capping the error budget remaining, e.g. a number from 0 and 1, meaning 0% left and 100% left respectively. Allowing negative number for the remaining budget doesn't mean anything really, e.g. "you have -999% left" 🤷🏻 . You either have all or some budget left, or none.

Second, charting and trending: when charting consumed budget over time (over multiple SLO periods) we'll need to think through how capping plays a role.

This is like an history of the SLI value and error budget over time. I would suggest to have this behind a separate API, e.g. /slos/id/history?from&to so we can graph the evolution of the SLI and error budget consumption over time.
I would say this is out of scope for this story. But nonetheless, I agree we'll need to figure out how capping works there.

kdelemme self-assigned this Oct 3, 2022

kdelemme added the v8.6.0 label Oct 3, 2022

botelastic bot added the needs-team Issues missing a team label label Oct 3, 2022

kdelemme added Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" and removed needs-team Issues missing a team label labels Oct 3, 2022

kdelemme changed the title ~~[Actionable Observability] Compute SLO current stats~~ [Actionable Observability] Compute current service level indicator and error budget consumption Oct 4, 2022

This was referenced Oct 5, 2022

chore(slo): Compute SLI value and error budget consumption #142784

Merged

SLOs API - Phase 1 #137323

Closed

kdelemme closed this as completed in #142784 Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Actionable Observability] Compute current service level indicator and error budget consumption #142521

[Actionable Observability] Compute current service level indicator and error budget consumption #142521

kdelemme commented Oct 3, 2022 •

edited

Loading

elasticmachine commented Oct 3, 2022

simianhacker commented Oct 5, 2022

vinaychandrasekhar commented Oct 5, 2022 •

edited

Loading

kdelemme commented Oct 5, 2022 •

edited

Loading

[Actionable Observability] Compute current service level indicator and error budget consumption #142521

[Actionable Observability] Compute current service level indicator and error budget consumption #142521

Comments

kdelemme commented Oct 3, 2022 • edited Loading

📝 Summary

Example (stats) of the GET /slos/id

❓Questions

✅ Acceptance Criteria

elasticmachine commented Oct 3, 2022

simianhacker commented Oct 5, 2022

vinaychandrasekhar commented Oct 5, 2022 • edited Loading

kdelemme commented Oct 5, 2022 • edited Loading

kdelemme commented Oct 3, 2022 •

edited

Loading

vinaychandrasekhar commented Oct 5, 2022 •

edited

Loading

kdelemme commented Oct 5, 2022 •

edited

Loading