-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Response Ops][Task Manager] Resource based task scheduling - 2nd attempt #189626
[Response Ops][Task Manager] Resource based task scheduling - 2nd attempt #189626
Conversation
Pinging @elastic/response-ops (Team:ResponseOps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kibana.jsonc
LGTM
claimStrategy: this.config?.claim_strategy, | ||
heapSizeLimit: this.heapSizeLimit, | ||
isCloud: cloud?.isCloudEnabled ?? false, | ||
isServerless: !!serverless, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you may want to check this.initContext.env.packageInfo.buildFlavor === 'serverless'
and save yourself from one additional plugin dependency 😇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good tip! Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in 61f587b
@elasticmachine merge upstream |
…mao1/kibana into tm-resource-based-scheduling-attempt-2
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM! Looked at the code diff and the additional commits, ran some local tests using both claimers, etc
@elasticmachine merge upstream Noticed we hadn't run this on serverless, so building some cloud/serverless images to make sure they basically run ... |
💛 Build succeeded, but was flaky
Failed CI StepsMetrics [docs]Public APIs missing comments
Public APIs missing exports
History
To update your PR or re-run it, just comment with: cc @ymao1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code LGTM; ran this locally w/mget, seems to work fine. Ran in cloud ESS and serverless (default claimer), and everything looks fine.
@@ -64,6 +67,8 @@ const requestTimeoutsConfig = schema.object({ | |||
export const configSchema = schema.object( | |||
{ | |||
allow_reading_invalid_state: schema.boolean({ defaultValue: true }), | |||
/* The number of normal cost tasks that this Kibana instance will run simultaneously */ | |||
capacity: schema.maybe(schema.number({ min: MIN_CAPACITY, max: MAX_CAPACITY })), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to add this to kibana-docker
file, right? And to the cloud allow-list?
{ minHeap: 0, maxHeap: 1, capacity: 10 }, | ||
{ minHeap: 1, maxHeap: 2, capacity: 15 }, | ||
{ minHeap: 2, maxHeap: 4, capacity: 25, backgroundTaskNodeOnly: false }, | ||
{ minHeap: 2, maxHeap: 4, capacity: 50, backgroundTaskNodeOnly: true }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the constraints (cloud), presumably we'll not see anything greater than 4GB. Today. :-). But am wondering for a 4GB Kibana, what is the metrics.process.memory.heap.size_limit
? Could it end up being just over 4GB?
Wonder if we should set the maxHeap
value for the final 2 to a bigger number, I suspect 16 would cover us for a long time ... or I guess Infinity
would be better?
// Capacity config describes the number of normal-cost tasks that can be | ||
// run simulatenously. Multiple by the cost of a normal cost to determine | ||
// the maximum allowed cost | ||
this.maxAllowedCost = getCapacityInCost(capacity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is some code going to get complain-y if maxAllowedCost
is updated late, so is 0
for a while?
I created a followup issue for the comments I made in this PR: #190095 |
Summary
Redoing the resource based task claim PR: #187999 and followup PRs #189220 and #189117. Please see the descriptions of those PRs for more details.
This was original reverted because unregistered task types in serverless caused the task manager health aggregation to fail. This PR includes an additional commit to exclude unregistered task types from the health report: 58eb2b1.
To verify this, make sure you're using the
default
claim strategy, start up Kibana so that the default set of tasks get created. Then either disable a bunch of plugins via config:or comment out the task registration of a task that was previously scheduled (I'm using the observability AI assistant)
and restart Kibana. You should still be able to access the TM health report with the workload field and if you update the background health logging so it always logs and more frequently, you should see the logging succeed with no errors:
Below, I've made changes to always log the background health at a 15 second interval: