Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix diagnostics failures on max message size limit #1777

Merged
merged 10 commits into from
Nov 28, 2022

Conversation

michalpristas
Copy link
Contributor

@michalpristas michalpristas commented Nov 23, 2022

What does this PR do?

First it increases limit to 100MB from 4MB to support larger single files. No single unit diagnostics should be larger than this.

Second, it chunks diagnostics responses by units. It changes response from array to stream of Unit Diagnostics result.
This way we're not limited by total size of diagnostics and number of units running will not affect the limit.

Why is it important?

diagnostics collect fail on message body larger than max limit (4MB)
#1715

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Fixes: #1715

@michalpristas michalpristas added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent Label for the Agent team backport-skip v8.6.0 labels Nov 23, 2022
@michalpristas michalpristas requested a review from a team as a code owner November 23, 2022 08:20
@michalpristas michalpristas self-assigned this Nov 23, 2022
@michalpristas michalpristas requested review from AndersonQ and blakerouse and removed request for a team November 23, 2022 08:20
@michalpristas michalpristas marked this pull request as draft November 23, 2022 08:28
@elasticmachine
Copy link
Contributor

elasticmachine commented Nov 23, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-11-24T08:28:34.697+0000

  • Duration: 18 min 44 sec

Test stats 🧪

Test Results
Failed 0
Passed 4517
Skipped 13
Total 4530

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Contributor

elasticmachine commented Nov 23, 2022

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.305% (58/59) 👍
Files 71.066% (140/197) 👍
Classes 70.557% (266/377) 👍
Methods 54.828% (795/1450) 👍
Lines 40.331% (8595/21311) 👎 -0.009
Conditionals 100.0% (0/0) 💚

@michalpristas michalpristas changed the title Fix/grpc limit Fix diagnostics failures on max message size limit Nov 23, 2022
@michalpristas michalpristas marked this pull request as ready for review November 23, 2022 09:04
control.proto Show resolved Hide resolved
Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add a test that the max message size is actually enforced? Or that the new streaming RPC works correctly at a basic level?

The changes look reasonable but I don't see any tests yet. I think we'd probably want an E2E or integration test to prove that diagnostics collection actually works, that's what would have caught this problem initially.

@@ -29,27 +30,33 @@ import (
"github.com/elastic/elastic-agent/pkg/core/logger"
)

const (
apiStatusTimeout = 45 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this is long enough for the entire stream?

It looks like we don't collect CPU profiles by default (maybe we should be doing that here), but if we did and used the default duration they would each take 30s https://pkg.go.dev/net/http/pprof#Profile

It might be better to use a timeout for each chunk/message but I'm not sure if this is possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the 30s default CPU profile duration is the reason we don't collect it by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Blocker]
This is a good point. If the diagnostics is set to connect pprof profiles, how could this timeout be configured or is 45s always enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll get rid of it and let client drive the cancellation. seems like better idea

@michalpristas
Copy link
Contributor Author

michalpristas commented Nov 24, 2022

@cmacknz i'd rather add tests to e2e package so we can also verify zip has proper format/information.
without real units we're not really testing anything as no diagnostics are generated and sent.
i think mocking this would require more work than writing e2e which performs a real/proper test

@cmacknz
Copy link
Member

cmacknz commented Nov 24, 2022

@cmacknz i'd rather add tests to e2e package so we can also verify zip has proper format/information.
without real units we're not really testing anything as no diagnostics are generated and sent.
i think mocking this would require more work than writing e2e which performs a real/proper test

In general I agree an e2e/integration test is the correct test for this. I've created a follow up issue here to do this #1789.

We can get by with manual testing here to start, even if I don't really like it. I don't want to hold off on fixing this in 8.6 while we invent testing infrastructure for it. As some point we need to pay the price to do this though, or we'll keep having bugs like this.

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, a bit reluctantly without tests but I've verified it works manually. I changed the gRPC message size to 1 kB and the command still worked, although very slowly.

@michalpristas michalpristas merged commit 8aa9b46 into elastic:main Nov 28, 2022
@blakerouse
Copy link
Contributor

This is only changing it for the control protocol between the elastic-agent cmd and the daemon. Do we need to look into making a change for the elastic-agent-client for the spawned components?

@cmacknz
Copy link
Member

cmacknz commented Nov 28, 2022

Do we need to look into making a change for the elastic-agent-client for the spawned components?

I think we should to ensure the possibility of encountering this problem anywhere in the system is eliminated entirely.

Created an issue #1808

We already use a streaming protocol for actions so the gRPC protocol itself likely doesn't need to change there.

@michalpristas michalpristas added backport-v8.6.0 Automated backport with mergify and removed backport-skip labels Nov 29, 2022
mergify bot pushed a commit that referenced this pull request Nov 29, 2022
Fix diagnostics failures on max message size limit (#1777)

(cherry picked from commit 8aa9b46)
michalpristas added a commit that referenced this pull request Nov 29, 2022
Fix diagnostics failures on max message size limit (#1777)

(cherry picked from commit 8aa9b46)

Co-authored-by: Michal Pristas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.6.0 Automated backport with mergify bug Something isn't working skip-changelog Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team v8.6.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Agent diagnostics size can exceed the default gRPC maximum message size
5 participants