Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Response Ops][Task Manager] Resource based task scheduling - 2nd attempt #189626

Merged
merged 12 commits into from
Aug 7, 2024

Conversation

ymao1
Copy link
Contributor

@ymao1 ymao1 commented Jul 31, 2024

Summary

Redoing the resource based task claim PR: #187999 and followup PRs #189220 and #189117. Please see the descriptions of those PRs for more details.

This was original reverted because unregistered task types in serverless caused the task manager health aggregation to fail. This PR includes an additional commit to exclude unregistered task types from the health report: 58eb2b1.

To verify this, make sure you're using the default claim strategy, start up Kibana so that the default set of tasks get created. Then either disable a bunch of plugins via config:

# remove security and o11y
enterpriseSearch.enabled: false
xpack.apm.enabled: false
xpack.cloudSecurityPosture.enabled: false
xpack.fleet.enabled: false
xpack.infra.enabled: false
xpack.observability.enabled: false
xpack.observabilityAIAssistant.enabled: false
xpack.observabilityLogsExplorer.enabled: false
xpack.search.notebooks.enabled: false
xpack.securitySolution.enabled: false
xpack.uptime.enabled: false

or comment out the task registration of a task that was previously scheduled (I'm using the observability AI assistant)

--- a/x-pack/plugins/observability_solution/observability_ai_assistant/server/service/index.ts
+++ b/x-pack/plugins/observability_solution/observability_ai_assistant/server/service/index.ts
@@ -89,24 +89,24 @@ export class ObservabilityAIAssistantService {

     this.allowInit();

-    taskManager.registerTaskDefinitions({
-      [INDEX_QUEUED_DOCUMENTS_TASK_TYPE]: {
-        title: 'Index queued KB articles',
-        description:
-          'Indexes previously registered entries into the knowledge base when it is ready',
-        timeout: '30m',
-        maxAttempts: 2,
-        createTaskRunner: (context) => {
-          return {
-            run: async () => {
-              if (this.kbService) {
-                await this.kbService.processQueue();
-              }
-            },
-          };
-        },
-      },
-    });
+    // taskManager.registerTaskDefinitions({
+    //   [INDEX_QUEUED_DOCUMENTS_TASK_TYPE]: {
+    //     title: 'Index queued KB articles',
+    //     description:
+    //       'Indexes previously registered entries into the knowledge base when it is ready',
+    //     timeout: '30m',
+    //     maxAttempts: 2,
+    //     createTaskRunner: (context) => {
+    //       return {
+    //         run: async () => {
+    //           if (this.kbService) {
+    //             await this.kbService.processQueue();
+    //           }
+    //         },
+    //       };
+    //     },
+    //   },
+    // });
   }

and restart Kibana. You should still be able to access the TM health report with the workload field and if you update the background health logging so it always logs and more frequently, you should see the logging succeed with no errors:

Below, I've made changes to always log the background health at a 15 second interval:

--- a/x-pack/plugins/task_manager/server/plugin.ts
+++ b/x-pack/plugins/task_manager/server/plugin.ts
@@ -236,6 +236,7 @@ export class TaskManagerPlugin
     if (this.isNodeBackgroundTasksOnly()) {
       setupIntervalLogging(monitoredHealth$, this.logger, LogHealthForBackgroundTasksOnlyMinutes);
     }
+    setupIntervalLogging(monitoredHealth$, this.logger, LogHealthForBackgroundTasksOnlyMinutes);
reduce the logging interval


--- a/x-pack/plugins/task_manager/server/lib/log_health_metrics.ts
+++ b/x-pack/plugins/task_manager/server/lib/log_health_metrics.ts
@@ -35,7 +35,8 @@ export function setupIntervalLogging(
     monitoredHealth = m;
   });

-  setInterval(onInterval, 1000 * 60 * minutes);
+  // setInterval(onInterval, 1000 * 60 * minutes);
+  setInterval(onInterval, 1000 * 15);

   function onInterval() {

@ymao1 ymao1 changed the title Tm resource based scheduling attempt 2 [Response Ops][Task Manager] Resource based task scheduling - 2nd attempt Jul 31, 2024
@ymao1 ymao1 self-assigned this Jul 31, 2024
@ymao1 ymao1 added release_note:skip Skip the PR/issue when compiling release notes Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.16.0 labels Jul 31, 2024
@ymao1 ymao1 marked this pull request as ready for review July 31, 2024 15:04
@ymao1 ymao1 requested review from a team as code owners July 31, 2024 15:04
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@ymao1 ymao1 requested a review from pmuellr July 31, 2024 15:04
Copy link
Member

@afharo afharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kibana.jsonc LGTM

claimStrategy: this.config?.claim_strategy,
heapSizeLimit: this.heapSizeLimit,
isCloud: cloud?.isCloudEnabled ?? false,
isServerless: !!serverless,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you may want to check this.initContext.env.packageInfo.buildFlavor === 'serverless' and save yourself from one additional plugin dependency 😇

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good tip! Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 61f587b

@ymao1 ymao1 added the ci:project-deploy-elasticsearch Create an Elasticsearch Serverless project label Jul 31, 2024
@ymao1 ymao1 removed the ci:project-deploy-elasticsearch Create an Elasticsearch Serverless project label Aug 1, 2024
@ymao1
Copy link
Contributor Author

ymao1 commented Aug 2, 2024

@elasticmachine merge upstream

@ymao1
Copy link
Contributor Author

ymao1 commented Aug 3, 2024

@pmuellr I pushed some changes to accommodate the updates from your recent PR f180148

@pmuellr
Copy link
Member

pmuellr commented Aug 5, 2024

@elasticmachine merge upstream

Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM! Looked at the code diff and the additional commits, ran some local tests using both claimers, etc

@pmuellr
Copy link
Member

pmuellr commented Aug 6, 2024

@elasticmachine merge upstream

Noticed we hadn't run this on serverless, so building some cloud/serverless images to make sure they basically run ...

@kibana-ci
Copy link
Collaborator

kibana-ci commented Aug 6, 2024

💛 Build succeeded, but was flaky

  • Buildkite Build
  • Commit: 85690ec
  • Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-189626-85690ecfc479

Failed CI Steps

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
taskManager 62 63 +1

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
taskManager 5 7 +2
Unknown metric groups

API count

id before after diff
taskManager 105 107 +2

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM; ran this locally w/mget, seems to work fine. Ran in cloud ESS and serverless (default claimer), and everything looks fine.

@@ -64,6 +67,8 @@ const requestTimeoutsConfig = schema.object({
export const configSchema = schema.object(
{
allow_reading_invalid_state: schema.boolean({ defaultValue: true }),
/* The number of normal cost tasks that this Kibana instance will run simultaneously */
capacity: schema.maybe(schema.number({ min: MIN_CAPACITY, max: MAX_CAPACITY })),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to add this to kibana-docker file, right? And to the cloud allow-list?

{ minHeap: 0, maxHeap: 1, capacity: 10 },
{ minHeap: 1, maxHeap: 2, capacity: 15 },
{ minHeap: 2, maxHeap: 4, capacity: 25, backgroundTaskNodeOnly: false },
{ minHeap: 2, maxHeap: 4, capacity: 50, backgroundTaskNodeOnly: true },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the constraints (cloud), presumably we'll not see anything greater than 4GB. Today. :-). But am wondering for a 4GB Kibana, what is the metrics.process.memory.heap.size_limit? Could it end up being just over 4GB?

Wonder if we should set the maxHeap value for the final 2 to a bigger number, I suspect 16 would cover us for a long time ... or I guess Infinity would be better?

// Capacity config describes the number of normal-cost tasks that can be
// run simulatenously. Multiple by the cost of a normal cost to determine
// the maximum allowed cost
this.maxAllowedCost = getCapacityInCost(capacity);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is some code going to get complain-y if maxAllowedCost is updated late, so is 0 for a while?

@pmuellr
Copy link
Member

pmuellr commented Aug 7, 2024

I created a followup issue for the comments I made in this PR: #190095

@pmuellr pmuellr merged commit e46e54a into elastic:main Aug 7, 2024
38 checks passed
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting ci:build-cloud-image ci:build-serverless-image Feature:Task Manager release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants