[Response Ops][Task Manager] Resource based task scheduling - 2nd attempt #189626

ymao1 · 2024-07-31T12:25:30Z

Summary

Redoing the resource based task claim PR: #187999 and followup PRs #189220 and #189117. Please see the descriptions of those PRs for more details.

This was original reverted because unregistered task types in serverless caused the task manager health aggregation to fail. This PR includes an additional commit to exclude unregistered task types from the health report: 58eb2b1.

To verify this, make sure you're using the default claim strategy, start up Kibana so that the default set of tasks get created. Then either disable a bunch of plugins via config:

# remove security and o11y
enterpriseSearch.enabled: false
xpack.apm.enabled: false
xpack.cloudSecurityPosture.enabled: false
xpack.fleet.enabled: false
xpack.infra.enabled: false
xpack.observability.enabled: false
xpack.observabilityAIAssistant.enabled: false
xpack.observabilityLogsExplorer.enabled: false
xpack.search.notebooks.enabled: false
xpack.securitySolution.enabled: false
xpack.uptime.enabled: false

or comment out the task registration of a task that was previously scheduled (I'm using the observability AI assistant)

--- a/x-pack/plugins/observability_solution/observability_ai_assistant/server/service/index.ts
+++ b/x-pack/plugins/observability_solution/observability_ai_assistant/server/service/index.ts
@@ -89,24 +89,24 @@ export class ObservabilityAIAssistantService {

     this.allowInit();

-    taskManager.registerTaskDefinitions({
-      [INDEX_QUEUED_DOCUMENTS_TASK_TYPE]: {
-        title: 'Index queued KB articles',
-        description:
-          'Indexes previously registered entries into the knowledge base when it is ready',
-        timeout: '30m',
-        maxAttempts: 2,
-        createTaskRunner: (context) => {
-          return {
-            run: async () => {
-              if (this.kbService) {
-                await this.kbService.processQueue();
-              }
-            },
-          };
-        },
-      },
-    });
+    // taskManager.registerTaskDefinitions({
+    //   [INDEX_QUEUED_DOCUMENTS_TASK_TYPE]: {
+    //     title: 'Index queued KB articles',
+    //     description:
+    //       'Indexes previously registered entries into the knowledge base when it is ready',
+    //     timeout: '30m',
+    //     maxAttempts: 2,
+    //     createTaskRunner: (context) => {
+    //       return {
+    //         run: async () => {
+    //           if (this.kbService) {
+    //             await this.kbService.processQueue();
+    //           }
+    //         },
+    //       };
+    //     },
+    //   },
+    // });
   }

and restart Kibana. You should still be able to access the TM health report with the workload field and if you update the background health logging so it always logs and more frequently, you should see the logging succeed with no errors:

Below, I've made changes to always log the background health at a 15 second interval:

--- a/x-pack/plugins/task_manager/server/plugin.ts
+++ b/x-pack/plugins/task_manager/server/plugin.ts
@@ -236,6 +236,7 @@ export class TaskManagerPlugin
     if (this.isNodeBackgroundTasksOnly()) {
       setupIntervalLogging(monitoredHealth$, this.logger, LogHealthForBackgroundTasksOnlyMinutes);
     }
+    setupIntervalLogging(monitoredHealth$, this.logger, LogHealthForBackgroundTasksOnlyMinutes);
reduce the logging interval


--- a/x-pack/plugins/task_manager/server/lib/log_health_metrics.ts
+++ b/x-pack/plugins/task_manager/server/lib/log_health_metrics.ts
@@ -35,7 +35,8 @@ export function setupIntervalLogging(
     monitoredHealth = m;
   });

-  setInterval(onInterval, 1000 * 60 * minutes);
+  // setInterval(onInterval, 1000 * 60 * minutes);
+  setInterval(onInterval, 1000 * 15);

   function onInterval() {

…ic#189529) (elastic#189554)" This reverts commit 7b38be0.

…empt-2

elasticmachine · 2024-07-31T15:04:38Z

Pinging @elastic/response-ops (Team:ResponseOps)

afharo

kibana.jsonc LGTM

afharo · 2024-07-31T17:54:07Z

x-pack/plugins/task_manager/server/plugin.ts

+      claimStrategy: this.config?.claim_strategy,
+      heapSizeLimit: this.heapSizeLimit,
+      isCloud: cloud?.isCloudEnabled ?? false,
+      isServerless: !!serverless,


nit: you may want to check this.initContext.env.packageInfo.buildFlavor === 'serverless' and save yourself from one additional plugin dependency 😇

Oh good tip! Thanks!

Updated in 61f587b

…fix'

ymao1 · 2024-08-02T13:14:44Z

@elasticmachine merge upstream

…mao1/kibana into tm-resource-based-scheduling-attempt-2

ymao1 · 2024-08-03T02:24:23Z

@pmuellr I pushed some changes to accommodate the updates from your recent PR f180148

pmuellr · 2024-08-05T19:14:11Z

@elasticmachine merge upstream

mikecote

Changes LGTM! Looked at the code diff and the additional commits, ran some local tests using both claimers, etc

pmuellr · 2024-08-06T21:36:19Z

@elasticmachine merge upstream

Noticed we hadn't run this on serverless, so building some cloud/serverless images to make sure they basically run ...

kibana-ci · 2024-08-06T22:31:16Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 85690ec
Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-189626-85690ecfc479

Failed CI Steps

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id	before	after	diff
`taskManager`	62	63	+1

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id	before	after	diff
`taskManager`	5	7	+2

Unknown metric groups

API count

id	before	after	diff
`taskManager`	105	107	+2

History

💚 Build #225935 succeeded b9d8888
💚 Build #225712 succeeded 8d6ac21
💔 Build #225696 failed a81787a
💚 Build #225583 succeeded e09ae50
💛 Build #225150 was flaky e761d81

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

pmuellr

code LGTM; ran this locally w/mget, seems to work fine. Ran in cloud ESS and serverless (default claimer), and everything looks fine.

pmuellr · 2024-08-02T15:26:31Z

x-pack/plugins/task_manager/server/config.ts

@@ -64,6 +67,8 @@ const requestTimeoutsConfig = schema.object({
 export const configSchema = schema.object(
  {
    allow_reading_invalid_state: schema.boolean({ defaultValue: true }),
+    /* The number of normal cost tasks that this Kibana instance will run simultaneously */
+    capacity: schema.maybe(schema.number({ min: MIN_CAPACITY, max: MAX_CAPACITY })),


We want to add this to kibana-docker file, right? And to the cloud allow-list?

pmuellr · 2024-08-02T15:49:22Z

x-pack/plugins/task_manager/server/lib/get_default_capacity.ts

+  { minHeap: 0, maxHeap: 1, capacity: 10 },
+  { minHeap: 1, maxHeap: 2, capacity: 15 },
+  { minHeap: 2, maxHeap: 4, capacity: 25, backgroundTaskNodeOnly: false },
+  { minHeap: 2, maxHeap: 4, capacity: 50, backgroundTaskNodeOnly: true },


Given the constraints (cloud), presumably we'll not see anything greater than 4GB. Today. :-). But am wondering for a 4GB Kibana, what is the metrics.process.memory.heap.size_limit? Could it end up being just over 4GB?

Wonder if we should set the maxHeap value for the final 2 to a bigger number, I suspect 16 would cover us for a long time ... or I guess Infinity would be better?

pmuellr · 2024-08-06T02:22:43Z

x-pack/plugins/task_manager/server/task_pool/cost_capacity.ts

+      // Capacity config describes the number of normal-cost tasks that can be
+      // run simulatenously. Multiple by the cost of a normal cost to determine
+      // the maximum allowed cost
+      this.maxAllowedCost = getCapacityInCost(capacity);


is some code going to get complain-y if maxAllowedCost is updated late, so is 0 for a while?

pmuellr · 2024-08-07T19:17:39Z

I created a followup issue for the comments I made in this PR: #190095

ymao1 added 3 commits July 30, 2024 08:27

Not showing unrecognized task types in the health summary

58eb2b1

Revert "[main] Revert TM resource based task scheduling issues (elast…

55981f1

…ic#189529) (elastic#189554)" This reverts commit 7b38be0.

Merge branch 'tm-task-type-fix' into tm-resource-based-scheduling-att…

42a347a

…empt-2

ymao1 changed the title ~~Tm resource based scheduling attempt 2~~ [Response Ops][Task Manager] Resource based task scheduling - 2nd attempt Jul 31, 2024

ymao1 self-assigned this Jul 31, 2024

ymao1 added release_note:skip Skip the PR/issue when compiling release notes Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.16.0 labels Jul 31, 2024

ymao1 marked this pull request as ready for review July 31, 2024 15:04

ymao1 requested review from a team as code owners July 31, 2024 15:04

ymao1 requested a review from pmuellr July 31, 2024 15:04

afharo approved these changes Jul 31, 2024

View reviewed changes

afharo reviewed Jul 31, 2024

View reviewed changes

ymao1 added the ci:project-deploy-elasticsearch Create an Elasticsearch Serverless project label Jul 31, 2024

ymao1 and others added 2 commits July 31, 2024 15:13

Using buildFlavor to detect serverless

61f587b

[CI] Auto-commit changed files from 'node scripts/lint_ts_projects --…

e761d81

…fix'

ymao1 removed the ci:project-deploy-elasticsearch Create an Elasticsearch Serverless project label Aug 1, 2024

elasticmachine and others added 4 commits August 2, 2024 23:14

Merge branch 'main' into tm-resource-based-scheduling-attempt-2

e09ae50

Merging in main

6718717

Changes to work with TaskTypeDictionary.get changes

f180148

Merge branch 'tm-resource-based-scheduling-attempt-2' of github.com:y…

a81787a

…mao1/kibana into tm-resource-based-scheduling-attempt-2

Fixing types

8d6ac21

Merge branch 'main' into tm-resource-based-scheduling-attempt-2

b9d8888

mikecote approved these changes Aug 6, 2024

View reviewed changes

pmuellr added ci:build-cloud-image ci:build-serverless-image labels Aug 6, 2024

Merge branch 'main' into tm-resource-based-scheduling-attempt-2

85690ec

pmuellr approved these changes Aug 7, 2024

View reviewed changes

pmuellr mentioned this pull request Aug 7, 2024

[ResponseOps][TaskManager] followups from resource based scheduling PR #190095

Closed

pmuellr merged commit e46e54a into elastic:main Aug 7, 2024
38 checks passed

kibanamachine added the backport:skip This commit does not require backporting label Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Response Ops][Task Manager] Resource based task scheduling - 2nd attempt #189626

[Response Ops][Task Manager] Resource based task scheduling - 2nd attempt #189626

ymao1 commented Jul 31, 2024 •

edited

Loading

elasticmachine commented Jul 31, 2024

afharo left a comment

afharo Jul 31, 2024

ymao1 Jul 31, 2024

ymao1 Jul 31, 2024

ymao1 commented Aug 2, 2024

ymao1 commented Aug 3, 2024

pmuellr commented Aug 5, 2024

mikecote left a comment

pmuellr commented Aug 6, 2024

kibana-ci commented Aug 6, 2024 •

edited

Loading

API count

pmuellr left a comment

pmuellr Aug 2, 2024

pmuellr Aug 2, 2024

pmuellr Aug 6, 2024

pmuellr commented Aug 7, 2024

[Response Ops][Task Manager] Resource based task scheduling - 2nd attempt #189626

[Response Ops][Task Manager] Resource based task scheduling - 2nd attempt #189626

Conversation

ymao1 commented Jul 31, 2024 • edited Loading

Summary

elasticmachine commented Jul 31, 2024

afharo left a comment

Choose a reason for hiding this comment

afharo Jul 31, 2024

Choose a reason for hiding this comment

ymao1 Jul 31, 2024

Choose a reason for hiding this comment

ymao1 Jul 31, 2024

Choose a reason for hiding this comment

ymao1 commented Aug 2, 2024

ymao1 commented Aug 3, 2024

pmuellr commented Aug 5, 2024

mikecote left a comment

Choose a reason for hiding this comment

pmuellr commented Aug 6, 2024

kibana-ci commented Aug 6, 2024 • edited Loading

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

Public APIs missing comments

Public APIs missing exports

API count

History

pmuellr left a comment

Choose a reason for hiding this comment

pmuellr Aug 2, 2024

Choose a reason for hiding this comment

pmuellr Aug 2, 2024

Choose a reason for hiding this comment

pmuellr Aug 6, 2024

Choose a reason for hiding this comment

pmuellr commented Aug 7, 2024

ymao1 commented Jul 31, 2024 •

edited

Loading

kibana-ci commented Aug 6, 2024 •

edited

Loading