Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report uploads now use timeouts and retriees with backoff #34

Merged
merged 2 commits into from
Aug 1, 2024

Conversation

quintesse
Copy link
Contributor

Fixes mwtele 225 and 226

NB: This needs the very latest version of the client to work (at least 2.0.3-SNAPSHOT) so this code won't work properly until either 2.0.3 has been released or you manually build the latest client and make the agent use it.

Copy link
Collaborator

@kittylyst kittylyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want QE to weigh in on this one before I approve - does it actually address the issue?

As it stands, 413 and 415 (which are client-side more-or-less-unrecoverable errors) are treated the same way as 503 (the code which was the original point of the issue).

FWIW, the spec says this about 413:

"If the condition is temporary, the server SHOULD generate a Retry-After header field to indicate that it is temporary and after what time the client MAY try again."

We emphatically do not check that header, even assuming it's created (and propagated back to us)

For 503:

"The server MAY send a Retry-After header field to suggest an appropriate amount of time for the client to wait before retrying the request."

Given the normative language, my reading of it is that neither 413 or 415 should be treated as transient conditions, but arguably 503 should be (and that given the lack of guidance from the server, our backoff strategy is acceptable) - therefore they should be handled differently. As it stands, this PR does not do that.

@quintesse
Copy link
Contributor Author

Indeed, this PR simply copies the similar behavior that currently exists in the Insights client code. Treating different errors differently from the PoV of the BackoffWrapper code wouldn't be too hard, but isn't currently implemented. We would have to define how to distinguish the errors that are "repeatable" from the ones that aren't, though.

@mocenas
Copy link
Contributor

mocenas commented Aug 1, 2024

@kittylyst
This thing does address the problem with no retry. Original motivation was to retry in case that server is overloaded or has whatever TEMPORARY problem. Since http client cannot really know if server side problem is temporary or not, it makes sense to me to retry for all 5xx codes.

I agree that it would make sense to not backoff in case of not-recoverable errors, basically for all 4xx codes. If you want to extend BackoffWrapper to distinguish recoverable and non-recoverable errors. That is IMHO OK. As @quintesse stated, this PR copies the same behaviour as used in client-runtime. So it should also be addressed there.
Also this PR should handle 403 code.

As I already stated somewhere, this PR IMHO has too long timeout (1 minute) and too many retries (5 retries exponentially backing off) which combined can take ~6-7 minutes to timeout, in extreme case. That is IMHO too much.

To conclude, this PR solves the original problems, but IMHO can be tweaked to do it better.

@kittylyst
Copy link
Collaborator

@mocenas I understand where you're coming from, but I think I'm "agree to disagree" about this. 😺

However, I think we're all in agreement that this is an improvement, and we are under some time pressure for this.

So, :lgtm: and we can tweak this in a future release.

Copy link
Collaborator

@kittylyst kittylyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kittylyst kittylyst merged commit fb3b09f into RedHatInsights:main Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants