Fix diagnostics failures on max message size limit #1777

michalpristas · 2022-11-23T08:20:39Z

What does this PR do?

First it increases limit to 100MB from 4MB to support larger single files. No single unit diagnostics should be larger than this.

Second, it chunks diagnostics responses by units. It changes response from array to stream of Unit Diagnostics result.
This way we're not limited by total size of diagnostics and number of units running will not affect the limit.

Why is it important?

diagnostics collect fail on message body larger than max limit (4MB)
#1715

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Fixes: #1715

elasticmachine · 2022-11-23T08:35:37Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-11-24T08:28:34.697+0000
Duration: 18 min 44 sec

Test stats 🧪

Test	Results
Failed	0
Passed	4517
Skipped	13
Total	4530

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages.
run integration tests : Run the Elastic Agent Integration tests.
run end-to-end tests : Generate the packages and run the E2E Tests.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2022-11-23T08:35:45Z

🌐 Coverage report

Name	Metrics % (`covered/total`)	Diff
Packages	98.305% (`58/59`)	👍
Files	71.066% (`140/197`)	👍
Classes	70.557% (`266/377`)	👍
Methods	54.828% (`795/1450`)	👍
Lines	40.331% (`8595/21311`)	👎 -0.009
Conditionals	100.0% (`0/0`)	💚

control.proto

cmacknz

Is it possible to add a test that the max message size is actually enforced? Or that the new streaming RPC works correctly at a basic level?

The changes look reasonable but I don't see any tests yet. I think we'd probably want an E2E or integration test to prove that diagnostics collection actually works, that's what would have caught this problem initially.

cmacknz · 2022-11-23T20:25:35Z

internal/pkg/agent/control/server/server.go

@@ -29,27 +30,33 @@ import (
 	"github.com/elastic/elastic-agent/pkg/core/logger"
 )

+const (
+	apiStatusTimeout = 45 * time.Second


Are we sure this is long enough for the entire stream?

It looks like we don't collect CPU profiles by default (maybe we should be doing that here), but if we did and used the default duration they would each take 30s https://pkg.go.dev/net/http/pprof#Profile

It might be better to use a timeout for each chunk/message but I'm not sure if this is possible.

Maybe the 30s default CPU profile duration is the reason we don't collect it by default.

[Blocker]
This is a good point. If the diagnostics is set to connect pprof profiles, how could this timeout be configured or is 45s always enough?

i'll get rid of it and let client drive the cancellation. seems like better idea

michalpristas · 2022-11-24T08:11:15Z

@cmacknz i'd rather add tests to e2e package so we can also verify zip has proper format/information.
without real units we're not really testing anything as no diagnostics are generated and sent.
i think mocking this would require more work than writing e2e which performs a real/proper test

cmacknz · 2022-11-24T17:57:34Z

@cmacknz i'd rather add tests to e2e package so we can also verify zip has proper format/information.
without real units we're not really testing anything as no diagnostics are generated and sent.
i think mocking this would require more work than writing e2e which performs a real/proper test

In general I agree an e2e/integration test is the correct test for this. I've created a follow up issue here to do this #1789.

We can get by with manual testing here to start, even if I don't really like it. I don't want to hold off on fixing this in 8.6 while we invent testing infrastructure for it. As some point we need to pay the price to do this though, or we'll keep having bugs like this.

cmacknz

Approving, a bit reluctantly without tests but I've verified it works manually. I changed the gRPC message size to 1 kB and the command still worked, although very slowly.

blakerouse · 2022-11-28T16:07:36Z

This is only changing it for the control protocol between the elastic-agent cmd and the daemon. Do we need to look into making a change for the elastic-agent-client for the spawned components?

cmacknz · 2022-11-28T19:29:12Z

Do we need to look into making a change for the elastic-agent-client for the spawned components?

I think we should to ensure the possibility of encountering this problem anywhere in the system is eliminated entirely.

Created an issue #1808

We already use a streaming protocol for actions so the gRPC protocol itself likely doesn't need to change there.

Fix diagnostics failures on max message size limit (#1777) (cherry picked from commit 8aa9b46)

Fix diagnostics failures on max message size limit (#1777) (cherry picked from commit 8aa9b46) Co-authored-by: Michal Pristas <[email protected]>

michalpristas added 3 commits November 23, 2022 08:57

increase limit for diagnostics to 15MB

3b218a8

updated templates

d25c86d

100mb

374a40d

michalpristas added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent Label for the Agent team backport-skip v8.6.0 labels Nov 23, 2022

michalpristas requested a review from a team as a code owner November 23, 2022 08:20

michalpristas self-assigned this Nov 23, 2022

michalpristas requested review from AndersonQ and blakerouse and removed request for a team November 23, 2022 08:20

michalpristas marked this pull request as draft November 23, 2022 08:28

streaming support

ce5ae36

michalpristas changed the title ~~Fix/grpc limit~~ Fix diagnostics failures on max message size limit Nov 23, 2022

lint

877220d

michalpristas added the skip-changelog label Nov 23, 2022

lint:

8d1cfc2

michalpristas marked this pull request as ready for review November 23, 2022 09:04

michalpristas requested a review from cmacknz November 23, 2022 09:11

michalpristas added 3 commits November 23, 2022 10:23

windows

d133eb7

windows

e374c53

windows

e1e9d5f

cmacknz reviewed Nov 23, 2022

View reviewed changes

control.proto Show resolved Hide resolved

cmacknz reviewed Nov 23, 2022

View reviewed changes

cancel on caller cancellation

f12b166

AndersonQ approved these changes Nov 24, 2022

View reviewed changes

cmacknz approved these changes Nov 24, 2022

View reviewed changes

michalpristas merged commit 8aa9b46 into elastic:main Nov 28, 2022

cmacknz mentioned this pull request Nov 28, 2022

Diagnostic action responses should be chunked to avoid exceeding the maximum gRPC message size #1808

Open

michalpristas added backport-v8.6.0 Automated backport with mergify and removed backport-skip labels Nov 29, 2022

mergify bot mentioned this pull request Nov 29, 2022

[8.6](backport #1777) Fix diagnostics failures on max message size limit #1821

Merged

mergify bot pushed a commit that referenced this pull request Nov 29, 2022

Fix diagnostics failures on max message size limit (#1777)

f16b4cd

Fix diagnostics failures on max message size limit (#1777) (cherry picked from commit 8aa9b46)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix diagnostics failures on max message size limit #1777

Fix diagnostics failures on max message size limit #1777

michalpristas commented Nov 23, 2022 •

edited

Loading

elasticmachine commented Nov 23, 2022 •

edited

Loading

Build stats

Test stats 🧪

elasticmachine commented Nov 23, 2022 •

edited

Loading

cmacknz left a comment

cmacknz Nov 23, 2022

cmacknz Nov 23, 2022

AndersonQ Nov 24, 2022

michalpristas Nov 24, 2022

michalpristas commented Nov 24, 2022 •

edited

Loading

cmacknz commented Nov 24, 2022

cmacknz left a comment

blakerouse commented Nov 28, 2022

cmacknz commented Nov 28, 2022

Fix diagnostics failures on max message size limit #1777

Fix diagnostics failures on max message size limit #1777

Conversation

michalpristas commented Nov 23, 2022 • edited Loading

What does this PR do?

Why is it important?

Checklist

elasticmachine commented Nov 23, 2022 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

elasticmachine commented Nov 23, 2022 • edited Loading

🌐 Coverage report

cmacknz left a comment

Choose a reason for hiding this comment

cmacknz Nov 23, 2022

Choose a reason for hiding this comment

cmacknz Nov 23, 2022

Choose a reason for hiding this comment

AndersonQ Nov 24, 2022

Choose a reason for hiding this comment

michalpristas Nov 24, 2022

Choose a reason for hiding this comment

michalpristas commented Nov 24, 2022 • edited Loading

cmacknz commented Nov 24, 2022

cmacknz left a comment

Choose a reason for hiding this comment

blakerouse commented Nov 28, 2022

cmacknz commented Nov 28, 2022

michalpristas commented Nov 23, 2022 •

edited

Loading

elasticmachine commented Nov 23, 2022 •

edited

Loading

elasticmachine commented Nov 23, 2022 •

edited

Loading

michalpristas commented Nov 24, 2022 •

edited

Loading