Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps][TaskManager] followups from resource based scheduling PR #192124

Merged
merged 15 commits into from
Sep 12, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/settings/task-manager-settings.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ How often, in milliseconds, the task manager will look for more work. Defaults
How many requests can Task Manager buffer before it rejects new requests. Defaults to 1000.

`xpack.task_manager.max_workers`::
deprecated:[8.16.0]
The maximum number of tasks that this Kibana instance will run simultaneously. Defaults to 10.
Starting in 8.0, it will not be possible to set the value greater than 100.

Expand All @@ -48,6 +49,9 @@ Enables event loop delay monitoring, which will log a warning when a task causes
`xpack.task_manager.event_loop_delay.warn_threshold`::
Sets the amount of event loop delay during a task execution which will cause a warning to be logged. Defaults to 5000 milliseconds (5 seconds).

`xpack.task_manager.capacity`::
Sets the number of normal cost tasks that can be run at one time. Can be minimum 5 and maximum 50. Defaults to 10.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about something like controls the number of tasks that can be run at one time? cc @lcawl

Wording can be confusing because the capacity means something different depending on what the claim strategy is. It is either the number of tasks that can be run at one time or the total cost of tasks that can be run at one time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense to me! Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I update the text in this commit baf8040


[float]
[[task-manager-health-settings]]
==== Task Manager Health settings
Expand Down
2 changes: 1 addition & 1 deletion docs/user/alerting/alerting-troubleshooting.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,7 @@ If cluster performance becomes degraded from excessive or expensive rules and {k

[source,txt]
--------------------------------------------------
xpack.task_manager.max_workers: 1
xpack.task_manager.capacity: 5
xpack.task_manager.poll_interval: 1h
--------------------------------------------------

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ By default, each additional {kib} instance will add an additional 10 tasks that

Other times it, might be preferable to increase the throughput of individual {kib} instances.

Tweak the *Max Workers* via the <<task-manager-settings,`xpack.task_manager.max_workers`>> setting, which allows each {kib} to pull a higher number of tasks per interval. This could impact the performance of each {kib} instance as the workload will be higher.
Tweak the *Capacity* via the <<task-manager-settings,`xpack.task_manager.capacity`>> setting, which allows each {kib} to pull a higher number of tasks per interval. This could impact the performance of each {kib} instance as the workload will be higher.
lcawl marked this conversation as resolved.
Show resolved Hide resolved

Tweak the *Poll Interval* via the <<task-manager-settings,`xpack.task_manager.poll_interval`>> setting, which allows each {kib} to pull scheduled tasks at a higher rate. This could impact the performance of the {es} cluster as the workload will be higher.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1002,7 +1002,7 @@ server log [12:41:33.672] [warn][plugins][taskManager][taskManager] taskManager

This log message tells us that Task Manager is not managing to keep up with the sheer amount of work it has been tasked with completing. This might mean that rules are not running at the frequency that was expected (instead of running every 5 minutes, it runs every 7-8 minutes, just as an example).

By default Task Manager is limited to 10 tasks and this can be bumped up by setting a higher number in the kibana.yml file using the `xpack.task_manager.max_workers` configuration. It is important to keep in mind that a higher number of tasks running at any given time means more load on both Kibana and Elasticsearch, so only change this setting if increasing load in your environment makes sense.
By default Task Manager is limited to 10 tasks and this can be bumped up by setting a higher number in the kibana.yml file using the `xpack.task_manager.capacity` configuration. It is important to keep in mind that a higher number of tasks running at any given time means more load on both Kibana and Elasticsearch, so only change this setting if increasing load in your environment makes sense.
doakalexi marked this conversation as resolved.
Show resolved Hide resolved

Another approach to addressing this might be to tell workers to run at a higher rate, rather than adding more of them, which would be configured using xpack.task_manager.poll_interval. This value dictates how often Task Manager checks to see if there’s more work to be done and uses milliseconds (by default it is 3000, which means an interval of 3 seconds).

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,7 @@ kibana_vars=(
xpack.alerting.rules.run.actions.max
xpack.alerting.rules.run.alerts.max
xpack.alerting.rules.run.actions.connectorTypeOverrides
xpack.alerting.maxScheduledPerMinute
xpack.alerts.healthCheck.interval
xpack.alerts.invalidateApiKeysTask.interval
xpack.alerts.invalidateApiKeysTask.removalDelay
Expand Down Expand Up @@ -431,6 +432,8 @@ kibana_vars=(
xpack.task_manager.event_loop_delay.monitor
xpack.task_manager.event_loop_delay.warn_threshold
xpack.task_manager.worker_utilization_running_average_window
xpack.discovery.active_nodes_lookback
doakalexi marked this conversation as resolved.
Show resolved Hide resolved
xpack.discovery.interval
xpack.uptime.index
serverless
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ interface GetDefaultCapacityOpts {
const HEAP_TO_CAPACITY_MAP = [
{ minHeap: 0, maxHeap: 1, capacity: 10 },
{ minHeap: 1, maxHeap: 2, capacity: 15 },
{ minHeap: 2, maxHeap: 4, capacity: 25, backgroundTaskNodeOnly: false },
{ minHeap: 2, maxHeap: 4, capacity: 50, backgroundTaskNodeOnly: true },
{ minHeap: 2, maxHeap: 16, capacity: 25, backgroundTaskNodeOnly: false },
{ minHeap: 2, maxHeap: 16, capacity: 50, backgroundTaskNodeOnly: true },
];

export function getDefaultCapacity({
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ describe('CostCapacity', () => {
const capacity$ = new Subject<number>();
const pool = new CostCapacity({ capacity$, logger });

expect(pool.capacity).toBe(0);
expect(pool.capacity).toBe(10);

capacity$.next(20);
expect(pool.capacity).toBe(40);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,14 @@
*/

import { Logger } from '@kbn/core/server';
import { DEFAULT_CAPACITY } from '../config';
import { TaskDefinition } from '../task';
import { TaskRunner } from '../task_running';
import { CapacityOpts, ICapacity } from './types';
import { getCapacityInCost } from './utils';

export class CostCapacity implements ICapacity {
private maxAllowedCost: number = 0;
private maxAllowedCost: number = DEFAULT_CAPACITY;
private logger: Logger;

constructor(opts: CapacityOpts) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -517,11 +517,11 @@ describe('TaskPool', () => {
expect(pool.availableCapacity()).toEqual(14);
});

test('availableCapacity is 0 until capacity$ pushes a value', async () => {
test('availableCapacity is 10 until capacity$ pushes a value', async () => {
const capacity$ = new Subject<number>();
const pool = new TaskPool({ capacity$, definitions, logger, strategy: CLAIM_STRATEGY_MGET });

expect(pool.availableCapacity()).toEqual(0);
expect(pool.availableCapacity()).toEqual(10);
capacity$.next(20);
expect(pool.availableCapacity()).toEqual(40);
});
Expand Down