-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block plan application until state store has caught up to raft #5411
Conversation
983a68c
to
0c45898
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall logic lgtm, agree with @dadgar's comment about increasing the timeout
Generalize wait for index logic in the state store for reuse elsewhere. Also begin plumbing in a context to combine handling of timeouts and shutdown.
I don't think it's been used for a long time.
Wait for state store to catch up with raft when applying plans.
Avoid returning context.DeadlineExceeded as it lacks helpful information and is often ignored or handled specially by callers.
0c45898
to
a2e4f12
Compare
Revert plan_apply.go changes from #5411 Since non-Command Raft messages do not update the StateStore index, SnapshotAfter may unnecessarily block and needlessly fail in idle clusters where the last Raft message is a non-Command message. This is trivially reproducible with the dev agent and a job that has 2 tasks, 1 of which fails. The correct logic would be to SnapshotAfter the previous plan's index to ensure consistency. New clusters or newly elected leaders will not have a previous plan, so the index the leader was elected should be used instead.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Do not merge until 0.9.1
This guards against a situation where planApply could dequeue a plan that references objects that are committed to raft but not yet applied to the leader's state store.
To do this I refactored
worker.waitForIndex
intoStateStore.SnapshotAfter
(new name welcome) so it could be shared between the 3 call sites. Backoff logic should be maintained although error log lines will vary slightly from the old waitForIndex errors ("timeout" vs "deadline exceeded").TODOs