[Monitoring] Fetch shard data more efficiently #54028

chrisronline · 2020-01-06T17:23:14Z

While debugging some performance issues on an ESMS cluster with @pickypg, we discovered that our query to fetch shard data (in an oversharded environment) performed very poorly. It turns out that there are a few major issues with our existing query:

No matter the provided parameters, we always were fetching shard data for indices (which unnecessarily slowed down the node listing page)
Most of the time, we only care about unassigned shards, but the query looks at all shard states.
There is no way to filter the query down to only a specific node or index, which means that on each specific node/index page, we are fetching all shard data.

This PR fixes all of these issues and drastically improves the loading time of various ES monitoring pages that slow down for large clusters.

Performance

On a sample ESMS cluster (which is severely oversharded) in a constant, absolute time period, I tested the timing to fetch shard stats data.

Current

Indices listing: ~23s
Nodes listing: ~23s
ML jobs listing: ~23s
ES cluster overview: ~23s
Index detail page: ~23s
Node detail page: ~23s

PR

Indices listing: ~1.7s
Nodes listing: ~1.7s
ML jobs listing: ~1.7s
ES cluster overview: ~1.7s
Index detail page: ~215ms
Node detail page: ~1.2s

Testing

This is a bit tricky. The UI should be unaffected - the api should return the same data the UI needs so we're just looking to ensure we didn't miss something.

Notes

Some of these test fixes are a result of a bad change here. I think the test were changed to a point where they weren't even working with real data.

kibanamachine · 2020-01-06T18:38:12Z

💔 Build Failed

continuous-integration/kibana-ci/pull-request
Commit: a30ab44

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

…ata_listing_pages

elasticmachine · 2020-01-07T16:29:43Z

Pinging @elastic/stack-monitoring (Team:Monitoring)

chrisronline · 2020-01-07T18:15:51Z

x-pack/test/api_integration/apis/monitoring/elasticsearch/fixtures/index_detail.json

    }
  ],
  "shardStats": {
    "nodes": {
      "jUT5KdxfRbORSCWkb5zjmA": {
-        "shardCount": 38,
-        "indexCount": 20,
+        "shardCount": 5,


I'm pretty sure these are different because the new query now limits the shard data to the specific index instead of across all indices

chrisronline · 2020-01-07T18:16:50Z

x-pack/test/api_integration/apis/monitoring/elasticsearch/node_detail.js

@@ -32,7 +32,7 @@ export default function({ getService }) {
    it('should summarize node with metrics', async () => {
      const { body } = await supertest
        .post(
-          '/api/monitoring/v1/clusters/YCxj-RAgSZCP6GuOQ8M1EQ/elasticsearch/nodes/jxcP6ue7eRCieNNitFTT0EA'
+          '/api/monitoring/v1/clusters/YCxj-RAgSZCP6GuOQ8M1EQ/elasticsearch/nodes/jUT5KdxfRbORSCWkb5zjmA'


I have no idea why this was changed. The original node id doesn't actually exist in the archived data! See #23715

igoristic · 2020-01-09T23:17:48Z

...lugins/monitoring/server/lib/elasticsearch/shards/get_indices_unassigned_shard_stats.test.js

+    const esIndexPattern = '*';
+    const cluster = {};
+    const stats = await getIndicesUnassignedShardStats(req, esIndexPattern, cluster);
+    expect(stats.indices).toEqual(indices);


Is it possible to also test status here? Since, looks like you already have the right replica/primary counts to test for all three colors 💚 💛 ❤️

It should test for it now. There is a status field here -> https://github.com/elastic/kibana/pull/54028/files/ffdb7d79aa65d7694eb3f2d88d45c16bedfcfc27#diff-3429702abd39406ddf3dc4c1ad63f5a6R12

igoristic

This is awesome stuff @chrisronline! 🏆

My benchmarks were a little faster overall, but were still within similar margins. Maybe because I ran it from docker (or my computer is > than yours)

chrisronline · 2020-01-10T16:28:54Z

@elasticmachine merge upstream

kibanamachine · 2020-01-10T17:49:00Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: 11bea85

History

💚 Build #18234 succeeded ffdb7d7
💔 Build #18208 failed 236ea55

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

* For the nodes listing page, do not fetch shard data for indices * Optimize our shard queries for the index and node listing pages * This change isn't necessary * Rename file and function * Use optimized query for ml jobs and es overview * Apply to node/index detail page, and more renaming * Unnecessary change * Fix tests * Add basic tests Co-authored-by: Elastic Machine <[email protected]>

* For the nodes listing page, do not fetch shard data for indices * Optimize our shard queries for the index and node listing pages * This change isn't necessary * Rename file and function * Use optimized query for ml jobs and es overview * Apply to node/index detail page, and more renaming * Unnecessary change * Fix tests * Add basic tests Co-authored-by: Elastic Machine <[email protected]> Co-authored-by: Elastic Machine <[email protected]>

chrisronline · 2020-01-10T21:07:05Z

Backport:

7.x: d811e4d

* master: (69 commits) [Graph] Fix various a11y issues (elastic#54097) Add ApplicationService app status management (elastic#50223) logs in one time (elastic#54447) Deprecate using `elasticsearch.ssl.certificate` without `elasticsearch.ssl.key` and vice versa (elastic#54392) [Optimizer] Fix a stack overflow with watch_cache when it attempts to delete very large folders. (elastic#54457) Security - Role Mappings UI (elastic#53620) [SIEM] [Detection engine] Permission II (elastic#54292) Allow User to Cleanup Repository from UI (elastic#53047) [Detection engine] Some UX for rule creation (elastic#54471) share specific instances of some ui packages (elastic#54079) [ML] APM modules configs for RUM Javascript and NodeJS (elastic#53792) [APM] Delay rendering invalid license notification (elastic#53924) [Graph] Improve error message on graph requests (elastic#54230) [ILM] Kibana should allow a min_age setting of 0ms in ILM policy phases (elastic#53719) Unit Tests for common/lib (elastic#53736) [Graph] Only show explorable fields (elastic#54101) remove linting rule exception for markdown (elastic#54232) [Monitoring] Fetch shard data more efficiently (elastic#54028) [Maps] Add hiddenLayers option to embeddable map input (elastic#54355) Pass termOrder and hasTermsAgg properties to serializeThresholdWatch function (elastic#54391) ...

chrisronline added 7 commits January 4, 2020 21:52

For the nodes listing page, do not fetch shard data for indices

9dc08c0

Optimize our shard queries for the index and node listing pages

73ded8c

This change isn't necessary

9c62921

Rename file and function

0f0d868

Use optimized query for ml jobs and es overview

b4e8371

Apply to node/index detail page, and more renaming

fe7ff41

Unnecessary change

a30ab44

chrisronline self-assigned this Jan 6, 2020

chrisronline added 3 commits January 6, 2020 15:23

Fix tests

80c93a7

Add basic tests

236ea55

Merge remote-tracking branch 'elastic/master' into monitoring/shard_d…

ffdb7d7

…ata_listing_pages

chrisronline marked this pull request as ready for review January 7, 2020 16:29

chrisronline requested a review from a team January 7, 2020 16:29

chrisronline added review v7.6.0 v8.0.0 release_note:enhancement Team:Monitoring Stack Monitoring team labels Jan 7, 2020

chrisronline commented Jan 7, 2020

View reviewed changes

chrisronline requested review from igoristic and removed request for a team January 7, 2020 20:20

igoristic reviewed Jan 9, 2020

View reviewed changes

igoristic approved these changes Jan 9, 2020

View reviewed changes

Merge branch 'master' into monitoring/shard_data_listing_pages

11bea85

chrisronline merged commit bf7c253 into elastic:master Jan 10, 2020

chrisronline deleted the monitoring/shard_data_listing_pages branch January 10, 2020 19:06

chrisronline mentioned this pull request Jan 10, 2020

[7.x] [Monitoring] Fetch shard data more efficiently (#54028) #54489

Merged

chrisronline mentioned this pull request Mar 9, 2020

[Monitoring] Pagination fetching shard data on nodes listing #50176

Closed

ycombinator mentioned this pull request Mar 19, 2020

Elasticsearch Stack Monitoring: shards data elastic/beats#17125

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Monitoring] Fetch shard data more efficiently #54028

[Monitoring] Fetch shard data more efficiently #54028

chrisronline commented Jan 6, 2020 •

edited

Loading

kibanamachine commented Jan 6, 2020

elasticmachine commented Jan 7, 2020

chrisronline Jan 7, 2020

chrisronline Jan 7, 2020

igoristic Jan 9, 2020

chrisronline Jan 10, 2020

igoristic left a comment

chrisronline commented Jan 10, 2020

kibanamachine commented Jan 10, 2020

chrisronline commented Jan 10, 2020

[Monitoring] Fetch shard data more efficiently #54028

[Monitoring] Fetch shard data more efficiently #54028

Conversation

chrisronline commented Jan 6, 2020 • edited Loading

Performance

Current

PR

Testing

Notes

kibanamachine commented Jan 6, 2020

💔 Build Failed

elasticmachine commented Jan 7, 2020

chrisronline Jan 7, 2020

Choose a reason for hiding this comment

chrisronline Jan 7, 2020

Choose a reason for hiding this comment

igoristic Jan 9, 2020

Choose a reason for hiding this comment

chrisronline Jan 10, 2020

Choose a reason for hiding this comment

igoristic left a comment

Choose a reason for hiding this comment

chrisronline commented Jan 10, 2020

kibanamachine commented Jan 10, 2020

💚 Build Succeeded

History

chrisronline commented Jan 10, 2020

chrisronline commented Jan 6, 2020 •

edited

Loading