-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report uploads now use timeouts and retriees with backoff #34
Conversation
Fixes MWTELE-226
Fixes MWTELE-225
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want QE to weigh in on this one before I approve - does it actually address the issue?
As it stands, 413 and 415 (which are client-side more-or-less-unrecoverable errors) are treated the same way as 503 (the code which was the original point of the issue).
FWIW, the spec says this about 413:
"If the condition is temporary, the server SHOULD generate a Retry-After header field to indicate that it is temporary and after what time the client MAY try again."
We emphatically do not check that header, even assuming it's created (and propagated back to us)
For 503:
"The server MAY send a Retry-After header field to suggest an appropriate amount of time for the client to wait before retrying the request."
Given the normative language, my reading of it is that neither 413 or 415 should be treated as transient conditions, but arguably 503 should be (and that given the lack of guidance from the server, our backoff strategy is acceptable) - therefore they should be handled differently. As it stands, this PR does not do that.
Indeed, this PR simply copies the similar behavior that currently exists in the Insights client code. Treating different errors differently from the PoV of the BackoffWrapper code wouldn't be too hard, but isn't currently implemented. We would have to define how to distinguish the errors that are "repeatable" from the ones that aren't, though. |
@kittylyst I agree that it would make sense to not backoff in case of not-recoverable errors, basically for all 4xx codes. If you want to extend BackoffWrapper to distinguish recoverable and non-recoverable errors. That is IMHO OK. As @quintesse stated, this PR copies the same behaviour as used in client-runtime. So it should also be addressed there. As I already stated somewhere, this PR IMHO has too long timeout (1 minute) and too many retries (5 retries exponentially backing off) which combined can take ~6-7 minutes to timeout, in extreme case. That is IMHO too much. To conclude, this PR solves the original problems, but IMHO can be tweaked to do it better. |
@mocenas I understand where you're coming from, but I think I'm "agree to disagree" about this. 😺 However, I think we're all in agreement that this is an improvement, and we are under some time pressure for this. So, :lgtm: and we can tweak this in a future release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes mwtele 225 and 226
NB: This needs the very latest version of the client to work (at least 2.0.3-SNAPSHOT) so this code won't work properly until either 2.0.3 has been released or you manually build the latest client and make the agent use it.