Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Manager] Batches the update operations in Task Manager #71470

Merged
merged 10 commits into from
Jul 21, 2020

Conversation

gmmorris
Copy link
Contributor

@gmmorris gmmorris commented Jul 13, 2020

Summary

This PR attempts to batch update tasks in Task Manager in order to avoid overloading the Elasticsearch queue.
This is the 1st PR addressing #65551

Under the hood we now use a Reactive buffer accumulates all calls to the update api in the TaskStore and flushes after 50ms or when as many operations as there are workers have been buffered (whichever comes first).

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@gmmorris gmmorris added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.10.0 v8.0.0 labels Jul 13, 2020
gmmorris added 4 commits July 13, 2020 19:38
* master: (314 commits)
  [APM] Use status_code field to calculate error rate (elastic#71109)
  [Observability] Change appLink passing the date range (elastic#71259)
  [Security] Add Timeline improvements (elastic#71506)
  adjust vislib bar opacity (elastic#71421)
  Fix ScopedHistory mock and adapt usages (elastic#71404)
  [Security Solution] Add hook for reading/writing resolver query params (elastic#70809)
  [APM] Bug fixes from ML integration testing (elastic#71564)
  [Discover] Add caused_by.type and caused_by.reason to error toast modal (elastic#70404)
  [Security Solution] Add 3rd level breadcrumb to admin page (elastic#71275)
  [Security Solution][Exceptions] Exception modal bulk close alerts that match exception attributes (elastic#71321)
  Change signal.rule.risk score mapping from keyword to float (elastic#71126)
  Added help text where needed on connectors and alert actions UI (elastic#69601)
  [SIEM][Detections] Value Lists Management Modal (elastic#67068)
  [test] Skips test preventing promotion of ES snapshot elastic#71582
  [test] Skips test preventing promotion of ES snapshot elastic#71555
  [ILM] Fix alignment of the timing field (elastic#71273)
  [SIEM][Detection Engine][Lists] Adds the ability for exception lists to be multi-list queried. (elastic#71540)
  initial telemetry setup (elastic#69330)
  [Reporting] Formatting fixes for CSV export in Discover, CSV download from Dashboard panel (elastic#67027)
  Search across spaces (elastic#67644)
  ...
@gmmorris gmmorris marked this pull request as ready for review July 14, 2020 15:03
@gmmorris gmmorris requested a review from a team as a code owner July 14, 2020 15:03
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

* master: (214 commits)
  replacing hard coded links for ela.st (elastic#72240)
  skip flaky suite (elastic#60865)
  chore(NA): teardown dynamic dll plugin (elastic#72096)
  Register navLink actions for declared applications (elastic#72109)
  Fix value for process.hash.sha256 draggable (elastic#72142)
  Call setupIngest before fleet_install tests (elastic#72214)
  [Security Solution][Detections] Better toast errors (elastic#72205)
  skip flaky suite (elastic#64696)
  [Security Solution][Detections] Disable exceptions for Threshold and ML rules (elastic#72137)
  [Security Solution][Detections,Lists] Miscellaneous post-FF fixes (elastic#71990)
  [baseline/capture] use high-memory nodes with ramDisks (elastic#71894)
  skip flaky suite (elastic#77207)
  [Maps] Fix issue preventing TMS from rendering correctly (elastic#71946)
  using test_user with minimum privs (elastic#71988)
  Fixed Webhook connector doesn't retain added HTTP header settings (elastic#71924)
  [Ingest Manager] Do not show enrolling and unenrolling agents as online in agent counters (elastic#71921)
  [Maps] fix 'New Map' from getting added to recently accessed (elastic#72125)
  [Visualizations] Pass 'aggs' parameter to custom request handlers (elastic#71423)
  [Monitoring] Out of the box alert tweaks (elastic#71942)
  [ML] Fix datafeed start time is incorrect when the job has trailing empty buckets (elastic#71976)
  ...
@mikecote mikecote self-requested a review July 17, 2020 17:12
@pmuellr
Copy link
Member

pmuellr commented Jul 17, 2020

Just tried this PR out with my 100 index threshold alerts scheduling a server log for 4 instances every second (TM poll interval) w/20 workers. Earlier this week, it backed up the TM queue pretty quickly, 1000's of entries in minutes. After about 5 minutes of running, queue not growing at all! Queue size guestimated via cat/indices?v. \o/

CPU usage seems up a little bit (130% or so before, now more like 150%), but I'd say that's expected when you reduce other latencies like this.

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM and seems to work as expected (more throughput with lots of active alerts)

x-pack/plugins/task_manager/server/lib/result_type.ts Outdated Show resolved Hide resolved
@@ -47,6 +53,25 @@ export async function promiseResult<T, E>(future: Promise<T>): Promise<Result<T,
}
}

export async function unwrapPromise<T, E>(future: Promise<Result<T, E>>): Promise<T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to wrap my head around what this is doing, thought I'd look at the tests, and there aren't any!

Would probably be useful to have tests for result_type.ts, not a blocker for this PR obviously ...

Copy link
Contributor Author

@gmmorris gmmorris Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fair- this used to be just typing, it now has some logic.
I'll add some tests.

In any case:
Basically unwrapPromise expects a Promise to resolve with a Result<T,E> and if it's an ok it resolves, if it's an err it rejects.
If the future rejects, we want to unwrap the error from it's Result container, and that's why we have the catch.
We can't put the catch after the then because then it'll catch the rejection we meant to return in the then.

It's annoying as hell and would be much clearer if Typescript just let us add typing to rejected Promises... the exceptions model in JS is... lacking... to say the least.

function isSavedObjectsUpdateResponse(
result: SavedObjectsUpdateResponse | Error
): result is SavedObjectsUpdateResponse {
return result && typeof (result as SavedObjectsUpdateResponse).id === 'string';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems slightly icky - the idea is that an error won't have an id property? Which seems pretty safe, but you'd think there would be a cleaner way to do this.

Copy link
Contributor Author

@gmmorris gmmorris Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agreed, but sadly, this is the shape returned by SavedObjectsRepository :/
Can't think of a better way without changing how the SavedObjectsRepository works.

Any ideas? 🤷

gmmorris added 4 commits July 20, 2020 15:12
* master: (60 commits)
  [SIEM][Detection Engine][Lists] Adds list permissions (elastic#72335)
  [SIEM][Detection Engine][Lists] Adds conflict versioning and io-ts improvements to lists (elastic#72337)
  [Resolver] no longer pass related event stats to process node component (elastic#72435)
  Revert "skip flaky suite (elastic#72146)"
  [Security Solution] Cleanup endpoint telemetry (elastic#71950)
  Unskip dashboard embeddable rendering tests (elastic#71824)
  [ENDPOINT] Added unerolling status for host. (elastic#72303)
  [Alerting][Connectors] Increase the size of the logos (elastic#72419)
  [SECURITY] [Timeline] Raw events not displayed (elastic#72387)
  [ML] Fixes display of regression stop stats if one is NaN (elastic#72412)
  [Ingest Pipelines] Processor Editor Move Tooltip (elastic#72239)
  Fix match phrase and not match phrase comparators (elastic#71850)
  [Plugin Generator] Generate tsconfig and useDefaultBehaviors (elastic#72040)
  [Security Solution][Timeline] Fix timeline styling and createFrom beh… (elastic#72152)
  [Resolver] Selector performance (elastic#72380)
  [Ingest Manager] Set `_meta` in the index.mappings (elastic#72026)
  [Ingest Manager] Do not bumb config revision during config creation (elastic#72270)
  [ML] Adding missing index pattern name to new job wizards (elastic#72400)
  [ML] improve annotation flyout performance (elastic#72299)
  [APM] Testing error rate API and restructuring folders (elastic#72257)
  ...
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Build metrics

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just one question

import { unwrapPromise } from './lib/result_type';

// by default allow updates to be buffered for up to 50ms
const DEFAULT_BUFFER_MAX_DURATION = 50;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think there would be value to have this as a kibana.yml config with a default of that number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered it, but it feel very internal and I think for now we can keep it as a constant.
We can expose it later if we feel it's worth it, but I think the poll_interval and max_workers is enough cognitive load for our users. :)

@gmmorris gmmorris merged commit 8fdebc9 into elastic:master Jul 21, 2020
gmmorris added a commit to gmmorris/kibana that referenced this pull request Jul 21, 2020
…#71470)

This PR attempts to batch update tasks in Task Manager in order to avoid overloading the Elasticsearch queue.
This is the 1st PR addressing elastic#65551

Under the hood we now use a Reactive buffer accumulates all calls to the `update` api in the TaskStore and flushes after 50ms or when as many operations as there are workers have been buffered (whichever comes first).
gmmorris added a commit that referenced this pull request Jul 21, 2020
…#72626)

This PR attempts to batch update tasks in Task Manager in order to avoid overloading the Elasticsearch queue.
This is the 1st PR addressing #65551

Under the hood we now use a Reactive buffer accumulates all calls to the `update` api in the TaskStore and flushes after 50ms or when as many operations as there are workers have been buffered (whichever comes first).
gmmorris added a commit to gmmorris/kibana that referenced this pull request Jul 21, 2020
* master:
  [Security Solution][Timeline] Add Empty view to the Timelines page (elastic#72576)
  [Security Solution][Resolver] Show process detail panel when clicking a process node (elastic#72563)
  Move manifest packageConfig mocks into security_solution plugin (elastic#72527)
  [QA][Code Coverage] Fixup Team Assignment (elastic#72467)
  [docs] remove references to tile map visualization in supported aggregations (elastic#72493)
  [ci][apm-ui] fix argument name for disabling pr comments (elastic#72633)
  Only check that the event ids are the same in arrays (elastic#72624)
  Add doc titles to ES UI apps (elastic#71045)
  Add Upgrade Assistant API integration test to ensure the reindex operation saved object can handle immense error messages (elastic#72347)
  [APM] Disable flaky rum e2e’s (elastic#72614)
  Applying tiny fix from 72532 to main branch (elastic#72533)
  [APM] Update script with new roles/users (elastic#72599)
  [Security Solution] Add margin (elastic#72542)
  Migrated fixed_scroll karma tests to jest (elastic#72258)
  [ML] Handling data recognizer saved object errors (elastic#72447)
  [Monitoring] Fix the messaging around needing TLS enabled (elastic#72310)
  [Task Manager] Batches the update operations in Task Manager  (elastic#71470)
@tylersmalley
Copy link
Contributor

Reverted the 7.x backport due to #72700
7.x/7.10: 1df44e5

gmmorris added a commit to gmmorris/kibana that referenced this pull request Jul 21, 2020
…#71470)

This PR attempts to batch update tasks in Task Manager in order to avoid overloading the Elasticsearch queue.
This is the 1st PR addressing elastic#65551

Under the hood we now use a Reactive buffer accumulates all calls to the `update` api in the TaskStore and flushes after 50ms or when as many operations as there are workers have been buffered (whichever comes first).
gmmorris added a commit that referenced this pull request Jul 22, 2020
…71470) (#72724)

* [Task Manager] Batches the update operations in Task Manager  (#71470)

This PR attempts to batch update tasks in Task Manager in order to avoid overloading the Elasticsearch queue.
This is the 1st PR addressing #65551

Under the hood we now use a Reactive buffer accumulates all calls to the `update` api in the TaskStore and flushes after 50ms or when as many operations as there are workers have been buffered (whichever comes first).

* removed next tick scheduling as we dont use it and its test was too flaky
gmmorris added a commit to gmmorris/kibana that referenced this pull request Jul 22, 2020
…feature-privileges

* alerting/consumer-based-rbac:
  [Security Solution][Timeline] Add Empty view to the Timelines page (elastic#72576)
  [Security Solution][Resolver] Show process detail panel when clicking a process node (elastic#72563)
  renamed variable to make it clear the SO client is unsecured
  Move manifest packageConfig mocks into security_solution plugin (elastic#72527)
  [QA][Code Coverage] Fixup Team Assignment (elastic#72467)
  [docs] remove references to tile map visualization in supported aggregations (elastic#72493)
  [ci][apm-ui] fix argument name for disabling pr comments (elastic#72633)
  Only check that the event ids are the same in arrays (elastic#72624)
  includes hidden params type in SO client
  Add doc titles to ES UI apps (elastic#71045)
  Add Upgrade Assistant API integration test to ensure the reindex operation saved object can handle immense error messages (elastic#72347)
  [APM] Disable flaky rum e2e’s (elastic#72614)
  Applying tiny fix from 72532 to main branch (elastic#72533)
  [APM] Update script with new roles/users (elastic#72599)
  [Security Solution] Add margin (elastic#72542)
  Migrated fixed_scroll karma tests to jest (elastic#72258)
  [ML] Handling data recognizer saved object errors (elastic#72447)
  [Monitoring] Fix the messaging around needing TLS enabled (elastic#72310)
  [Task Manager] Batches the update operations in Task Manager  (elastic#71470)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting release_note:enhancement Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.10.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants