[Response Ops][Task Manager] Expose SLI metrics in HTTP API #162178

ymao1 · 2023-07-18T17:18:32Z

Summary

Exposes a new HTTP API at /api/task_manager/metrics that collects SLI metrics for task manager. The following metrics are exposed:

count of task claim successes & count of task claim tries - this is a counter metric that keeps track over overall task claim success, not task claim success of individual background task workers
count of task run success & count of task runs - this is a counter metric that keeps track of overall task run successes, as well as successes grouped by task type. Alerting and action task types are rolled up into an alerting and an actions group to allow us to calculate SLIs across all alerting rules and all actions
task claim duration in milliseconds - this is a histogram counter metric that is bucketed into 100 ms buckets

These counter metrics are incremented until a reset event is received, in which case the counter is reset back to 0. This allows the collection mechanism (in this case Elastic Agent) to determine the interval at which these metrics are collected as well as to collect the rate of change for these SLI metrics without having to perform complicated Elasticsearch aggregation math. In addition, the counters are reset every 30 seconds (this is configurable) to avoid providing the metrics collector with stale data in case of a collector outage.

Flaky test runner: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/2813

ymao1 · 2023-07-19T15:04:10Z

@elasticmachine merge upstream

…/slis

ymao1 · 2023-07-31T15:02:23Z

x-pack/plugins/task_manager/server/config.ts

  max_attempts: schema.number({ defaultValue: 100, min: 1, max: 500 }),
 });

 export const configSchema = schema.object(
  {
+    allow_reading_invalid_state: schema.boolean({ defaultValue: true }),


these changes are just alphabetized the config keys

ymao1 · 2023-08-07T17:50:09Z

x-pack/plugins/task_manager/server/task_running/task_runner.ts

@@ -321,9 +322,9 @@ export class TaskManagerRunner implements TaskRunner {

      let taskParamsValidation;
      if (this.requeueInvalidTasksConfig.enabled) {
-        taskParamsValidation = this.validateTaskParams();
+        taskParamsValidation = this.validateTaskParams(modifiedContext);


When adding functional tests to the task manager test suite, I noticed that the beforeSave and beforeRun middleware could potentially modify the task params which could potentially invalidate the shape of the params before this params validation occurs.

Specifically in x-pack/test/plugin_api_integration/plugins/sample_task_plugin/server/plugin.ts, the beforeSave middleware changes the task params shape while the beforeRun middleware restores them. In the task runner, we're validating on the task params as read from disk instead of after using beforeRun, so this change is to update that. I don't think it makes a difference here because we're not adding any task manager middleware but not sure what the right approach is here.

Did we ever find anyone using the middleware? I'm +1 on deleting it :-)

Discussed with @mikecote and we'll keep this change for now and have opened an issue to discuss removing the middleware hooks: #163401

ymao1 · 2023-08-07T17:50:37Z

x-pack/plugins/task_manager/server/task_running/task_runner.ts

-            taskTiming
-          )
-        );
+      async ({ runAt, schedule, hasError }: SuccessfulRunResult) => {


emitting the correct task run event for alerting rules.

…/slis

…-ref HEAD~1..HEAD --fix'

elasticmachine · 2023-08-08T02:02:51Z

Pinging @elastic/response-ops (Team:ResponseOps)

ymao1 · 2023-08-08T11:59:35Z

@elasticmachine merge upstream

ersin-erdal

LGTM
Tested locally, works as expected.

ymao1 · 2023-08-10T11:54:28Z

@elasticmachine merge upstream

kibana-ci · 2023-08-10T12:27:08Z

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

ESLint disabled line counts

id	before	after	diff
`taskManager`	24	25	+1

Total ESLint disabled count

id	before	after	diff
`taskManager`	24	25	+1

History

💚 Build #148284 succeeded 3da34a7
💚 Build #148117 succeeded 29cf9c4
💚 Build #148079 succeeded 48679d3
💚 Build #146589 succeeded 8a5c4f2
💔 Build #146422 failed 0d733ff

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

* main: (108 commits) [Telemetry Schema Validation] Allow `null` on `string` (elastic#163499) [Search] Add Slack and Gmail connectors (elastic#163321) [ML] Provide hints for empty fields in dropdown options in Anomaly detection & Transform creation wizards, Change point detection view (elastic#163371) chore(slo): Add response required fields (elastic#163430) [AO] Fix add_to_case functional test (elastic#163155) unskip license type functional test (elastic#163199) fix(NA): yarn env vars for node_modules mirrors (elastic#163549) [Response Ops][Task Manager] Expose SLI metrics in HTTP API (elastic#162178) [Logs UI] Adapt test to ES highlighting changes and unskip (elastic#163592) [Infra UI] Implement Telemetry on 'Show' buttons within Inventory (elastic#163587) [Enterprise Search]Migrate all usages of EuiPage*_Deprecated (elastic#163482) fix(slo): settings and access for serverless (elastic#163514) [Infra UI] Implement telemetry for the asset details flyout (elastic#163078) [Fleet] Add a banner to the top of the Kafka Output UI to say that Elastic Defend integration is not supported (elastic#163579) [Fleet] Re-enable and fix Fleet policy secret integration tests (elastic#163428) [Fleet] add managed to imported saved object (elastic#163526) [Index Management] Disable index actions using contextRef (elastic#163475) [Discover] Inline shard failures warnings (elastic#161271) [Security Solution][Detection engine] skips geo_point non-ecs validation (elastic#163487) Update EUI layout components in bfetch example plugin (elastic#163490) ...

…lastic#162178)" This reverts commit 582d97d.

#163639) …#162178)" This reverts commit 582d97d. ## Summary Summarize your PR. If it involves visual changes include a screenshot or gif. ### Checklist Delete any items that are not applicable to this PR. - [ ] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] Any UI touched in this PR is usable by keyboard only (learn more about [keyboard accessibility](https://webaim.org/techniques/keyboard/)) - [ ] Any UI touched in this PR does not create any new axe failures (run axe in browser: [FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/), [Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US)) - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [ ] This renders correctly on smaller devices using a responsive layout. (You can test this [in your browser](https://www.browserstack.com/guide/responsive-testing-on-local-server)) - [ ] This was checked for [cross-browser compatibility](https://www.elastic.co/support/matrix#matrix_browsers) ### Risk Matrix Delete this section if it is not applicable to this PR. Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release. When forming the risk matrix, consider some of the following examples and how they may potentially impact the change: | Risk | Probability | Severity | Mitigation/Notes | |---------------------------|-------------|----------|-------------------------| | Multiple Spaces—unexpected behavior in non-default Kibana Space. | Low | High | Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces. | | Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks. | High | Low | Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure. | | Code should gracefully handle cases when feature X or plugin Y are disabled. | Medium | High | Unit tests will verify that any feature flag or plugin combination still results in our service operational. | | [See more potential risk examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) | ### For maintainers - [ ] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

…163652) ## Summary Redo of this PR #162178 but without the `native-hdr-histogram` library which caused issues in the serverless build. In the future we may want to pursue generating a custom build of this native library but for our current purposes, a simple bucketed histogram should suffice. The only changes from the original PR are in this commit: dde5245, where we create a `SimpleHistogram` class to bucket task claim durations into `100ms` buckets. Please reference the original PR for more description about this HTTP API --------- Co-authored-by: Kibana Machine <[email protected]>

ymao1 added 3 commits July 18, 2023 13:18

Adding counter for task polling success rate

2fd6d11

Fixing types

d92ee65

Resetting counter on interval and on route access

dd63d3f

ymao1 added the ci:cloud-deploy Create or update a Cloud deployment label Jul 19, 2023

kibanamachine and others added 13 commits July 19, 2023 11:04

Merge branch 'main' into alerting/slis

6497c0a

Adding task run metric

07b13da

Merge branch 'main' of github.com:elastic/kibana into alerting/slis

808bf7f

Merge branch 'alerting/slis' of github.com:ymao1/kibana into alerting…

ce03851

…/slis

Grouping alerting and action task type run results

7a7606f

Merge branch 'main' of github.com:elastic/kibana into alerting/slis

e71e796

Hack to force an error

3c515cf

Emitting correct event for alerting task failure

0216353

Default query param to true

99f89d8

Merge branch 'main' of github.com:elastic/kibana into alerting/slis

a7ab380

Merge branch 'main' of github.com:elastic/kibana into alerting/slis

72c70a4

Refactoring metrics streams

5e598b6

Merge branch 'main' of github.com:elastic/kibana into alerting/slis

8fa09ad

ymao1 commented Jul 31, 2023

View reviewed changes

ymao1 added 6 commits July 31, 2023 11:12

Cleanup

d3fc813

Adding hdr histogram for claim duration

81b899e

Merge branch 'main' of github.com:elastic/kibana into alerting/slis

0bbcf65

Merge branch 'main' of github.com:elastic/kibana into alerting/slis

0d733ff

Bucketing durations

b830920

Merge branch 'main' of github.com:elastic/kibana into alerting/slis

8a5c4f2

ymao1 changed the title ~~Adding counter for task polling success rate~~ [Response Ops][Task Manager] Expose SLI metrics in HTTP API Aug 1, 2023

ymao1 added 4 commits August 4, 2023 08:19

Merge branch 'main' of github.com:elastic/kibana into alerting/slis

a3cce3b

adding functional test

6a90edc

Merge branch 'main' of github.com:elastic/kibana into alerting/slis

7eed409

tests

2dcadca

ymao1 added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.10.0 labels Aug 7, 2023

ymao1 commented Aug 7, 2023

View reviewed changes

ymao1 and others added 4 commits August 7, 2023 13:51

cleanup

2353945

Merge branch 'alerting/slis' of github.com:ymao1/kibana into alerting…

48679d3

…/slis

fixing test

a7b25fc

[CI] Auto-commit changed files from 'node scripts/precommit_hook.js -…

29cf9c4

…-ref HEAD~1..HEAD --fix'

ymao1 marked this pull request as ready for review August 8, 2023 02:02

ymao1 requested a review from a team as a code owner August 8, 2023 02:02

Merge branch 'main' into alerting/slis

3da34a7

ersin-erdal approved these changes Aug 10, 2023

View reviewed changes

Merge branch 'main' into alerting/slis

d8192b8

ymao1 mentioned this pull request Aug 10, 2023

SLIs - Phase 1 #160334

Closed

ymao1 merged commit 582d97d into elastic:main Aug 10, 2023

ymao1 deleted the alerting/slis branch August 10, 2023 14:23

kibanamachine added the backport:skip This commit does not require backporting label Aug 10, 2023

ymao1 added a commit to ymao1/kibana that referenced this pull request Aug 10, 2023

Revert "[Response Ops][Task Manager] Expose SLI metrics in HTTP API (e…

4e4274b

…lastic#162178)" This reverts commit 582d97d.

ymao1 mentioned this pull request Aug 10, 2023

Revert "[Response Ops][Task Manager] Expose SLI metrics in HTTP API (… #163639

Merged

9 tasks

ymao1 mentioned this pull request Aug 11, 2023

[Response Ops][Task Manager] Expose SLI metrics in HTTP API - Take 2 #163652

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Response Ops][Task Manager] Expose SLI metrics in HTTP API #162178

[Response Ops][Task Manager] Expose SLI metrics in HTTP API #162178

ymao1 commented Jul 18, 2023 •

edited

Loading

ymao1 commented Jul 19, 2023

ymao1 Jul 31, 2023

ymao1 Aug 7, 2023

pmuellr Aug 8, 2023

ymao1 Aug 8, 2023

ymao1 Aug 7, 2023

elasticmachine commented Aug 8, 2023

ymao1 commented Aug 8, 2023

ersin-erdal left a comment

ymao1 commented Aug 10, 2023

kibana-ci commented Aug 10, 2023 •

edited

Loading

ESLint disabled line counts

Total ESLint disabled count

[Response Ops][Task Manager] Expose SLI metrics in HTTP API #162178

[Response Ops][Task Manager] Expose SLI metrics in HTTP API #162178

Conversation

ymao1 commented Jul 18, 2023 • edited Loading

Summary

ymao1 commented Jul 19, 2023

ymao1 Jul 31, 2023

Choose a reason for hiding this comment

ymao1 Aug 7, 2023

Choose a reason for hiding this comment

pmuellr Aug 8, 2023

Choose a reason for hiding this comment

ymao1 Aug 8, 2023

Choose a reason for hiding this comment

ymao1 Aug 7, 2023

Choose a reason for hiding this comment

elasticmachine commented Aug 8, 2023

ymao1 commented Aug 8, 2023

ersin-erdal left a comment

Choose a reason for hiding this comment

ymao1 commented Aug 10, 2023

kibana-ci commented Aug 10, 2023 • edited Loading

💚 Build Succeeded

Metrics [docs]

ESLint disabled line counts

Total ESLint disabled count

History

ymao1 commented Jul 18, 2023 •

edited

Loading

kibana-ci commented Aug 10, 2023 •

edited

Loading