-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved scheduler retry logic under high contention #787
Conversation
I wonder if retry attempts should be randomized, in order to avoid overwhelming the server when too many blocked evaluations are queued. Or does the max retries achieve the same effect? |
} | ||
|
||
e := s.ctx.Eligibility() | ||
classes := e.GetClasses() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May as well not track this if HasEscaped
Minor feedback, LGTM |
Improved scheduler retry logic under high contention
@c4milo the retry limit is there to prevent overwhelming the servers, exactly as you said! |
Nice! shouldn't retries be randomized then? So that in case of any general failure all the queued allocs aren't tried to be scheduled at the same time, DoSing the servers? Or is it unlikely to happen? |
I've seen similar scenarios happening before in other distributed systems, where a service would be unable recover due to all clients retrying at the same time and DoSing/overwhelming the service. |
@c4milo The evaluation broker handles this case. The scheduler limits how many retries it does in a hot loop, before yielding the scheduler thread and moving back into the evaluation broker. There is also randomization in the placement order to reduce contention under extremely high load as well. |
Great! Thanks Armon for explaining further!
|
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR resets the retry count if progress is made during scheduling and fails by creating a blocked eval.
@armon