Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log retries #3589

Merged
merged 1 commit into from
Nov 13, 2024
Merged

Log retries #3589

merged 1 commit into from
Nov 13, 2024

Conversation

MDrakos
Copy link
Member

@MDrakos MDrakos commented Nov 8, 2024

BugDX-3156 Ensure timeouts during polling are retried

I am fairly confident that our buildplanner polling is retrying. The pollBuildPlanned function uses the same underlying client that all buildplanner requests use. This client uses the HTTPTransport from the retry library.

The error message in the story description is also from the retry library: https://github.com/ActiveState/cli/blob/master/internal/retryhttp/client.go#L143. The pollBuildPlanned function has its own a timeout, but produces a different error.

For this PR I've added logging of the number of retries so we can ensure that they did happen when we check the log. We could also add the number of retries to the error message if that would be more helpful.

I've also verified that we retry server-side timeouts. We have a list of status codes that we check in the response to determine if we are going to retry the request and a server timeout is one of them: https://github.com/ActiveState/cli/blob/master/internal/retryhttp/client.go#L59

@github-actions github-actions bot changed the base branch from master to version/0-48-0-RC1 November 8, 2024 21:48
@MDrakos MDrakos marked this pull request as ready for review November 8, 2024 21:55
@MDrakos MDrakos requested a review from Naatan November 8, 2024 21:56
Copy link
Member

@Naatan Naatan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but it sounds like you miss-interpreted this AC:

Please also verify that the above error only happens on client-side timeouts, not server-side timeouts.

That's asking to make sure that we DO NOT show the error in the ticket when the timeout is from the server. ie. that error is only meant for network errors that are caused by the client.

@MDrakos
Copy link
Member Author

MDrakos commented Nov 8, 2024

Looks good, but it sounds like you miss-interpreted this AC:

Please also verify that the above error only happens on client-side timeouts, not server-side timeouts.

That's asking to make sure that we DO NOT show the error in the ticket when the timeout is from the server. ie. that error is only meant for network errors that are caused by the client.

This should be covered by our list of retryable status codes here: https://github.com/ActiveState/cli/blob/master/internal/retryhttp/client.go#L59

When we encounter a 408 status code we will retry, once we've exhausted the retries the error in the story is returned.

@MDrakos MDrakos requested a review from Naatan November 8, 2024 23:53
Copy link
Member

@Naatan Naatan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you missed my meaning.

Please also verify that the above error only happens on client-side timeouts, not server-side timeouts.

By this I am talking about the error that the user receives. I'm not making any statements about retrying. Your response points to the list of retryable http codes, which suggests you determined whether server side timeouts lead to a retry. That's good to know for sure but not what I'm asking about.

@MDrakos
Copy link
Member Author

MDrakos commented Nov 12, 2024

For a server side timeout the error that the users receives will still begin with failed to fetch commit during poll: Encountered unexpected error: however the error message afterwards will be specific to the error code: https://github.com/ActiveState/cli/blob/master/internal/retryhttp/client.go#L117.

The error in the story is a genuine timeout as it implements the net.Error interface where our UserNetworkError type that we return from this client does not.

We can update the buildplanner wrapping to handle the UserNetworkError type different and not wrap with the unexpected response though I'm not sure if that's something we want.

@MDrakos MDrakos requested a review from Naatan November 12, 2024 21:58
@Naatan
Copy link
Member

Naatan commented Nov 12, 2024

For a server side timeout the error that the users receives will still begin with failed to fetch commit during poll: Encountered unexpected error: however the error message afterwards will be specific to the error code: master/internal/retryhttp/client.go#L117.

The error in the story is a genuine timeout as it implements the net.Error interface where our UserNetworkError type that we return from this client does not.

We can update the buildplanner wrapping to handle the UserNetworkError type different and not wrap with the unexpected response though I'm not sure if that's something we want.

No that's fine, the only thing I wanted to make sure of as part of this story is that we do not show the user error for server timeouts. So that we can be sure that when a user gives us that error it's not due to server-side issues. At least as sure as network errors allow us to be.

@MDrakos MDrakos merged commit 2379b9c into version/0-48-0-RC1 Nov 13, 2024
10 checks passed
@MDrakos MDrakos deleted the DX-3156 branch November 13, 2024 00:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants