[Hotfix] Increase the timeout of the ProxyActor health check #2082
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
I observed that
NumServeEndpoints
changes frequently especially after we start to watchEndpoints
in #2080. The error message is:The timeout of the HTTP client is 20 ms. Hence, I increase the timeout to 2 seconds which is the same as the dashboard HTTP client.
CheckProxyActorHealth
does not fail during my 30-minute experiment. See this gist for more details.CheckHealth
fails 6 times in my 30-minute experiment. See this gist for more detailsI marked it as 'Hotfix' because I think 20 ms should be enough for my very simple setup (single Ray node, local Kind cluster, no requests). Hence, the instability may be a Ray Serve issue.
Related issue number
Checks