Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Response Ops][Task Manager] Expose SLI metrics in HTTP API #162178

Merged
merged 33 commits into from
Aug 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
2fd6d11
Adding counter for task polling success rate
ymao1 Jul 18, 2023
d92ee65
Fixing types
ymao1 Jul 18, 2023
dd63d3f
Resetting counter on interval and on route access
ymao1 Jul 18, 2023
6497c0a
Merge branch 'main' into alerting/slis
kibanamachine Jul 19, 2023
07b13da
Adding task run metric
ymao1 Jul 20, 2023
808bf7f
Merge branch 'main' of github.com:elastic/kibana into alerting/slis
ymao1 Jul 20, 2023
ce03851
Merge branch 'alerting/slis' of github.com:ymao1/kibana into alerting…
ymao1 Jul 20, 2023
7a7606f
Grouping alerting and action task type run results
ymao1 Jul 20, 2023
e71e796
Merge branch 'main' of github.com:elastic/kibana into alerting/slis
ymao1 Jul 24, 2023
3c515cf
Hack to force an error
ymao1 Jul 24, 2023
0216353
Emitting correct event for alerting task failure
ymao1 Jul 24, 2023
99f89d8
Default query param to true
ymao1 Jul 24, 2023
a7ab380
Merge branch 'main' of github.com:elastic/kibana into alerting/slis
ymao1 Jul 27, 2023
72c70a4
Merge branch 'main' of github.com:elastic/kibana into alerting/slis
ymao1 Jul 28, 2023
5e598b6
Refactoring metrics streams
ymao1 Jul 31, 2023
8fa09ad
Merge branch 'main' of github.com:elastic/kibana into alerting/slis
ymao1 Jul 31, 2023
d3fc813
Cleanup
ymao1 Jul 31, 2023
81b899e
Adding hdr histogram for claim duration
ymao1 Jul 31, 2023
0bbcf65
Merge branch 'main' of github.com:elastic/kibana into alerting/slis
ymao1 Jul 31, 2023
0d733ff
Merge branch 'main' of github.com:elastic/kibana into alerting/slis
ymao1 Aug 1, 2023
b830920
Bucketing durations
ymao1 Aug 1, 2023
8a5c4f2
Merge branch 'main' of github.com:elastic/kibana into alerting/slis
ymao1 Aug 1, 2023
a3cce3b
Merge branch 'main' of github.com:elastic/kibana into alerting/slis
ymao1 Aug 4, 2023
6a90edc
adding functional test
ymao1 Aug 7, 2023
7eed409
Merge branch 'main' of github.com:elastic/kibana into alerting/slis
ymao1 Aug 7, 2023
2dcadca
tests
ymao1 Aug 7, 2023
27dfb86
[CI] Auto-commit changed files from 'node scripts/precommit_hook.js -…
kibanamachine Aug 7, 2023
2353945
cleanup
ymao1 Aug 7, 2023
48679d3
Merge branch 'alerting/slis' of github.com:ymao1/kibana into alerting…
ymao1 Aug 7, 2023
a7b25fc
fixing test
ymao1 Aug 7, 2023
29cf9c4
[CI] Auto-commit changed files from 'node scripts/precommit_hook.js -…
kibanamachine Aug 7, 2023
3da34a7
Merge branch 'main' into alerting/slis
kibanamachine Aug 8, 2023
d8192b8
Merge branch 'main' into alerting/slis
kibanamachine Aug 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions x-pack/plugins/task_manager/server/config.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ describe('config validation', () => {
},
"max_attempts": 3,
"max_workers": 10,
"metrics_reset_interval": 30000,
"monitored_aggregated_stats_refresh_rate": 60000,
"monitored_stats_health_verbose_log": Object {
"enabled": false,
Expand Down Expand Up @@ -81,6 +82,7 @@ describe('config validation', () => {
},
"max_attempts": 3,
"max_workers": 10,
"metrics_reset_interval": 30000,
"monitored_aggregated_stats_refresh_rate": 60000,
"monitored_stats_health_verbose_log": Object {
"enabled": false,
Expand Down Expand Up @@ -137,6 +139,7 @@ describe('config validation', () => {
},
"max_attempts": 3,
"max_workers": 10,
"metrics_reset_interval": 30000,
"monitored_aggregated_stats_refresh_rate": 60000,
"monitored_stats_health_verbose_log": Object {
"enabled": false,
Expand Down
107 changes: 57 additions & 50 deletions x-pack/plugins/task_manager/server/config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ export const DEFAULT_MONITORING_REFRESH_RATE = 60 * 1000;
export const DEFAULT_MONITORING_STATS_RUNNING_AVERAGE_WINDOW = 50;
export const DEFAULT_MONITORING_STATS_WARN_DELAYED_TASK_START_IN_SECONDS = 60;

export const DEFAULT_METRICS_RESET_INTERVAL = 30 * 1000; // 30 seconds

// At the default poll interval of 3sec, this averages over the last 15sec.
export const DEFAULT_WORKER_UTILIZATION_RUNNING_AVERAGE_WINDOW = 5;

Expand Down Expand Up @@ -52,53 +54,63 @@ const eventLoopDelaySchema = schema.object({
});

const requeueInvalidTasksConfig = schema.object({
enabled: schema.boolean({ defaultValue: false }),
delay: schema.number({ defaultValue: 3000, min: 0 }),
enabled: schema.boolean({ defaultValue: false }),
max_attempts: schema.number({ defaultValue: 100, min: 1, max: 500 }),
});

export const configSchema = schema.object(
{
allow_reading_invalid_state: schema.boolean({ defaultValue: true }),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes are just alphabetized the config keys

ephemeral_tasks: schema.object({
enabled: schema.boolean({ defaultValue: false }),
/* How many requests can Task Manager buffer before it rejects new requests. */
request_capacity: schema.number({
// a nice round contrived number, feel free to change as we learn how it behaves
defaultValue: 10,
min: 1,
max: DEFAULT_MAX_EPHEMERAL_REQUEST_CAPACITY,
}),
}),
event_loop_delay: eventLoopDelaySchema,
/* The maximum number of times a task will be attempted before being abandoned as failed */
max_attempts: schema.number({
defaultValue: 3,
min: 1,
}),
/* How often, in milliseconds, the task manager will look for more work. */
poll_interval: schema.number({
defaultValue: DEFAULT_POLL_INTERVAL,
min: 100,
}),
/* How many requests can Task Manager buffer before it rejects new requests. */
request_capacity: schema.number({
// a nice round contrived number, feel free to change as we learn how it behaves
defaultValue: 1000,
min: 1,
}),
/* The maximum number of tasks that this Kibana instance will run simultaneously. */
max_workers: schema.number({
defaultValue: DEFAULT_MAX_WORKERS,
// disable the task manager rather than trying to specify it with 0 workers
min: 1,
}),
/* The threshold percenatge for workers experiencing version conflicts for shifting the polling interval. */
version_conflict_threshold: schema.number({
defaultValue: DEFAULT_VERSION_CONFLICT_THRESHOLD,
min: 50,
max: 100,
}),
/* The rate at which we emit fresh monitored stats. By default we'll use the poll_interval (+ a slight buffer) */
monitored_stats_required_freshness: schema.number({
defaultValue: (config?: unknown) =>
((config as { poll_interval: number })?.poll_interval ?? DEFAULT_POLL_INTERVAL) + 1000,
min: 100,
/* The interval at which monotonically increasing metrics counters will reset */
metrics_reset_interval: schema.number({
defaultValue: DEFAULT_METRICS_RESET_INTERVAL,
min: 10 * 1000, // minimum 10 seconds
}),
/* The rate at which we refresh monitored stats that require aggregation queries against ES. */
monitored_aggregated_stats_refresh_rate: schema.number({
defaultValue: DEFAULT_MONITORING_REFRESH_RATE,
/* don't run monitored stat aggregations any faster than once every 5 seconds */
min: 5000,
}),
monitored_stats_health_verbose_log: schema.object({
enabled: schema.boolean({ defaultValue: false }),
level: schema.oneOf([schema.literal('debug'), schema.literal('info')], {
defaultValue: 'debug',
}),
/* The amount of seconds we allow a task to delay before printing a warning server log */
warn_delayed_task_start_in_seconds: schema.number({
defaultValue: DEFAULT_MONITORING_STATS_WARN_DELAYED_TASK_START_IN_SECONDS,
}),
}),
/* The rate at which we emit fresh monitored stats. By default we'll use the poll_interval (+ a slight buffer) */
monitored_stats_required_freshness: schema.number({
defaultValue: (config?: unknown) =>
((config as { poll_interval: number })?.poll_interval ?? DEFAULT_POLL_INTERVAL) + 1000,
min: 100,
}),
/* The size of the running average window for monitored stats. */
monitored_stats_running_average_window: schema.number({
defaultValue: DEFAULT_MONITORING_STATS_RUNNING_AVERAGE_WINDOW,
Expand All @@ -107,44 +119,39 @@ export const configSchema = schema.object(
}),
/* Task Execution result warn & error thresholds. */
monitored_task_execution_thresholds: schema.object({
default: taskExecutionFailureThresholdSchema,
custom: schema.recordOf(schema.string(), taskExecutionFailureThresholdSchema, {
defaultValue: {},
}),
default: taskExecutionFailureThresholdSchema,
}),
monitored_stats_health_verbose_log: schema.object({
enabled: schema.boolean({ defaultValue: false }),
level: schema.oneOf([schema.literal('debug'), schema.literal('info')], {
defaultValue: 'debug',
}),
/* The amount of seconds we allow a task to delay before printing a warning server log */
warn_delayed_task_start_in_seconds: schema.number({
defaultValue: DEFAULT_MONITORING_STATS_WARN_DELAYED_TASK_START_IN_SECONDS,
}),
}),
ephemeral_tasks: schema.object({
enabled: schema.boolean({ defaultValue: false }),
/* How many requests can Task Manager buffer before it rejects new requests. */
request_capacity: schema.number({
// a nice round contrived number, feel free to change as we learn how it behaves
defaultValue: 10,
min: 1,
max: DEFAULT_MAX_EPHEMERAL_REQUEST_CAPACITY,
}),
/* How often, in milliseconds, the task manager will look for more work. */
poll_interval: schema.number({
defaultValue: DEFAULT_POLL_INTERVAL,
min: 100,
}),
event_loop_delay: eventLoopDelaySchema,
worker_utilization_running_average_window: schema.number({
defaultValue: DEFAULT_WORKER_UTILIZATION_RUNNING_AVERAGE_WINDOW,
max: 100,
/* How many requests can Task Manager buffer before it rejects new requests. */
request_capacity: schema.number({
// a nice round contrived number, feel free to change as we learn how it behaves
defaultValue: 1000,
min: 1,
}),
requeue_invalid_tasks: requeueInvalidTasksConfig,
/* These are not designed to be used by most users. Please use caution when changing these */
unsafe: schema.object({
exclude_task_types: schema.arrayOf(schema.string(), { defaultValue: [] }),
authenticate_background_task_utilization: schema.boolean({ defaultValue: true }),
exclude_task_types: schema.arrayOf(schema.string(), { defaultValue: [] }),
}),
/* The threshold percenatge for workers experiencing version conflicts for shifting the polling interval. */
version_conflict_threshold: schema.number({
defaultValue: DEFAULT_VERSION_CONFLICT_THRESHOLD,
min: 50,
max: 100,
}),
worker_utilization_running_average_window: schema.number({
defaultValue: DEFAULT_WORKER_UTILIZATION_RUNNING_AVERAGE_WINDOW,
max: 100,
min: 1,
}),
requeue_invalid_tasks: requeueInvalidTasksConfig,
allow_reading_invalid_state: schema.boolean({ defaultValue: true }),
},
{
validate: (config) => {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ describe('EphemeralTaskLifecycle', () => {
delay: 3000,
max_attempts: 20,
},
metrics_reset_interval: 3000,
...config,
},
elasticsearchAndSOAvailability$,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ describe('managed configuration', () => {
delay: 3000,
max_attempts: 20,
},
metrics_reset_interval: 3000,
});
logger = context.logger.get('taskManager');

Expand Down
Loading