Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring] Fetch shard data more efficiently #54028

Merged

Conversation

chrisronline
Copy link
Contributor

@chrisronline chrisronline commented Jan 6, 2020

While debugging some performance issues on an ESMS cluster with @pickypg, we discovered that our query to fetch shard data (in an oversharded environment) performed very poorly. It turns out that there are a few major issues with our existing query:

  1. No matter the provided parameters, we always were fetching shard data for indices (which unnecessarily slowed down the node listing page)
  2. Most of the time, we only care about unassigned shards, but the query looks at all shard states.
  3. There is no way to filter the query down to only a specific node or index, which means that on each specific node/index page, we are fetching all shard data.

This PR fixes all of these issues and drastically improves the loading time of various ES monitoring pages that slow down for large clusters.

Performance

On a sample ESMS cluster (which is severely oversharded) in a constant, absolute time period, I tested the timing to fetch shard stats data.

Current

Indices listing: ~23s
Nodes listing: ~23s
ML jobs listing: ~23s
ES cluster overview: ~23s
Index detail page: ~23s
Node detail page: ~23s

PR

Indices listing: ~1.7s
Nodes listing: ~1.7s
ML jobs listing: ~1.7s
ES cluster overview: ~1.7s
Index detail page: ~215ms
Node detail page: ~1.2s

Testing

This is a bit tricky. The UI should be unaffected - the api should return the same data the UI needs so we're just looking to ensure we didn't miss something.

Notes

  • Some of these test fixes are a result of a bad change here. I think the test were changed to a point where they weren't even working with real data.

@chrisronline chrisronline self-assigned this Jan 6, 2020
@kibanamachine
Copy link
Contributor

💔 Build Failed

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@chrisronline chrisronline marked this pull request as ready for review January 7, 2020 16:29
@chrisronline chrisronline requested a review from a team January 7, 2020 16:29
@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring (Team:Monitoring)

}
],
"shardStats": {
"nodes": {
"jUT5KdxfRbORSCWkb5zjmA": {
"shardCount": 38,
"indexCount": 20,
"shardCount": 5,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure these are different because the new query now limits the shard data to the specific index instead of across all indices

@@ -32,7 +32,7 @@ export default function({ getService }) {
it('should summarize node with metrics', async () => {
const { body } = await supertest
.post(
'/api/monitoring/v1/clusters/YCxj-RAgSZCP6GuOQ8M1EQ/elasticsearch/nodes/jxcP6ue7eRCieNNitFTT0EA'
'/api/monitoring/v1/clusters/YCxj-RAgSZCP6GuOQ8M1EQ/elasticsearch/nodes/jUT5KdxfRbORSCWkb5zjmA'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea why this was changed. The original node id doesn't actually exist in the archived data! See #23715

@chrisronline chrisronline requested review from igoristic and removed request for a team January 7, 2020 20:20
const esIndexPattern = '*';
const cluster = {};
const stats = await getIndicesUnassignedShardStats(req, esIndexPattern, cluster);
expect(stats.indices).toEqual(indices);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to also test status here? Since, looks like you already have the right replica/primary counts to test for all three colors 💚 💛 ❤️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@igoristic igoristic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome stuff @chrisronline! 🏆

My benchmarks were a little faster overall, but were still within similar margins. Maybe because I ran it from docker (or my computer is > than yours)

@chrisronline
Copy link
Contributor Author

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@chrisronline chrisronline merged commit bf7c253 into elastic:master Jan 10, 2020
@chrisronline chrisronline deleted the monitoring/shard_data_listing_pages branch January 10, 2020 19:06
chrisronline added a commit to chrisronline/kibana that referenced this pull request Jan 10, 2020
* For the nodes listing page, do not fetch shard data for indices

* Optimize our shard queries for the index and node listing pages

* This change isn't necessary

* Rename file and function

* Use optimized query for ml jobs and es overview

* Apply to node/index detail page, and more renaming

* Unnecessary change

* Fix tests

* Add basic tests

Co-authored-by: Elastic Machine <[email protected]>
chrisronline added a commit that referenced this pull request Jan 10, 2020
* For the nodes listing page, do not fetch shard data for indices

* Optimize our shard queries for the index and node listing pages

* This change isn't necessary

* Rename file and function

* Use optimized query for ml jobs and es overview

* Apply to node/index detail page, and more renaming

* Unnecessary change

* Fix tests

* Add basic tests

Co-authored-by: Elastic Machine <[email protected]>

Co-authored-by: Elastic Machine <[email protected]>
@chrisronline
Copy link
Contributor Author

Backport:

7.x: d811e4d

gmmorris added a commit to gmmorris/kibana that referenced this pull request Jan 13, 2020
* master: (69 commits)
  [Graph] Fix various a11y issues (elastic#54097)
  Add ApplicationService app status management (elastic#50223)
  logs in one time (elastic#54447)
  Deprecate using `elasticsearch.ssl.certificate` without `elasticsearch.ssl.key` and vice versa (elastic#54392)
  [Optimizer] Fix a stack overflow with watch_cache when it attempts to delete very large folders. (elastic#54457)
  Security - Role Mappings UI (elastic#53620)
  [SIEM] [Detection engine] Permission II (elastic#54292)
  Allow User to Cleanup Repository from UI  (elastic#53047)
  [Detection engine] Some UX for rule creation (elastic#54471)
  share specific instances of some ui packages (elastic#54079)
  [ML] APM modules configs for RUM Javascript and NodeJS (elastic#53792)
  [APM] Delay rendering invalid license notification (elastic#53924)
  [Graph] Improve error message on graph requests (elastic#54230)
  [ILM] Kibana should allow a min_age setting of 0ms in ILM policy phases (elastic#53719)
  Unit Tests for common/lib (elastic#53736)
  [Graph] Only show explorable fields (elastic#54101)
  remove linting rule exception for markdown (elastic#54232)
  [Monitoring] Fetch shard data more efficiently (elastic#54028)
  [Maps] Add hiddenLayers option to embeddable map input (elastic#54355)
  Pass termOrder and hasTermsAgg properties to serializeThresholdWatch function (elastic#54391)
  ...
gmmorris added a commit to gmmorris/kibana that referenced this pull request Jan 13, 2020
* master: (69 commits)
  [Graph] Fix various a11y issues (elastic#54097)
  Add ApplicationService app status management (elastic#50223)
  logs in one time (elastic#54447)
  Deprecate using `elasticsearch.ssl.certificate` without `elasticsearch.ssl.key` and vice versa (elastic#54392)
  [Optimizer] Fix a stack overflow with watch_cache when it attempts to delete very large folders. (elastic#54457)
  Security - Role Mappings UI (elastic#53620)
  [SIEM] [Detection engine] Permission II (elastic#54292)
  Allow User to Cleanup Repository from UI  (elastic#53047)
  [Detection engine] Some UX for rule creation (elastic#54471)
  share specific instances of some ui packages (elastic#54079)
  [ML] APM modules configs for RUM Javascript and NodeJS (elastic#53792)
  [APM] Delay rendering invalid license notification (elastic#53924)
  [Graph] Improve error message on graph requests (elastic#54230)
  [ILM] Kibana should allow a min_age setting of 0ms in ILM policy phases (elastic#53719)
  Unit Tests for common/lib (elastic#53736)
  [Graph] Only show explorable fields (elastic#54101)
  remove linting rule exception for markdown (elastic#54232)
  [Monitoring] Fetch shard data more efficiently (elastic#54028)
  [Maps] Add hiddenLayers option to embeddable map input (elastic#54355)
  Pass termOrder and hasTermsAgg properties to serializeThresholdWatch function (elastic#54391)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants