-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix diagnostics failures on max message size limit #1777
Conversation
🌐 Coverage report
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add a test that the max message size is actually enforced? Or that the new streaming RPC works correctly at a basic level?
The changes look reasonable but I don't see any tests yet. I think we'd probably want an E2E or integration test to prove that diagnostics collection actually works, that's what would have caught this problem initially.
@@ -29,27 +30,33 @@ import ( | |||
"github.com/elastic/elastic-agent/pkg/core/logger" | |||
) | |||
|
|||
const ( | |||
apiStatusTimeout = 45 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this is long enough for the entire stream?
It looks like we don't collect CPU profiles by default (maybe we should be doing that here), but if we did and used the default duration they would each take 30s https://pkg.go.dev/net/http/pprof#Profile
It might be better to use a timeout for each chunk/message but I'm not sure if this is possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the 30s default CPU profile duration is the reason we don't collect it by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Blocker]
This is a good point. If the diagnostics is set to connect pprof profiles, how could this timeout be configured or is 45s always enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll get rid of it and let client drive the cancellation. seems like better idea
@cmacknz i'd rather add tests to e2e package so we can also verify zip has proper format/information. |
In general I agree an e2e/integration test is the correct test for this. I've created a follow up issue here to do this #1789. We can get by with manual testing here to start, even if I don't really like it. I don't want to hold off on fixing this in 8.6 while we invent testing infrastructure for it. As some point we need to pay the price to do this though, or we'll keep having bugs like this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, a bit reluctantly without tests but I've verified it works manually. I changed the gRPC message size to 1 kB and the command still worked, although very slowly.
This is only changing it for the control protocol between the |
I think we should to ensure the possibility of encountering this problem anywhere in the system is eliminated entirely. Created an issue #1808 We already use a streaming protocol for actions so the gRPC protocol itself likely doesn't need to change there. |
Fix diagnostics failures on max message size limit (#1777) (cherry picked from commit 8aa9b46) Co-authored-by: Michal Pristas <[email protected]>
What does this PR do?
First it increases limit to 100MB from 4MB to support larger single files. No single unit diagnostics should be larger than this.
Second, it chunks diagnostics responses by units. It changes response from array to stream of Unit Diagnostics result.
This way we're not limited by total size of diagnostics and number of units running will not affect the limit.
Why is it important?
diagnostics collect
fail on message body larger than max limit (4MB)#1715
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Fixes: #1715