Block plan application until state store has caught up to raft #5411

schmichael · 2019-03-12T21:52:44Z

Do not merge until 0.9.1

This guards against a situation where planApply could dequeue a plan that references objects that are committed to raft but not yet applied to the leader's state store.

To do this I refactored worker.waitForIndex into StateStore.SnapshotAfter (new name welcome) so it could be shared between the 3 call sites. Backoff logic should be maintained although error log lines will vary slightly from the old waitForIndex errors ("timeout" vs "deadline exceeded").

TODOs

Unit test planApply - not sure how to control raft and state store's synchronization, but I haven't spent much time digging in. Advice welcome.
Integration/performance test - This will be done in a follow up PR.
Metrics and/or logging around the planApply callsite
- Perhaps metrics should be internal to SnapshotAfter and we skip timing the fast path that doesn't block? That would create a nice metric that only measures any lag between the indices and not the non-blocking path.

nomad/plan_apply.go

nomad/worker.go

nomad/state/notify.go

preetapan

Overall logic lgtm, agree with @dadgar's comment about increasing the timeout

Generalize wait for index logic in the state store for reuse elsewhere. Also begin plumbing in a context to combine handling of timeouts and shutdown.

I don't think it's been used for a long time.

Wait for state store to catch up with raft when applying plans.

Avoid returning context.DeadlineExceeded as it lacks helpful information and is often ignored or handled specially by callers.

Revert plan_apply.go changes from #5411 Since non-Command Raft messages do not update the StateStore index, SnapshotAfter may unnecessarily block and needlessly fail in idle clusters where the last Raft message is a non-Command message. This is trivially reproducible with the dev agent and a job that has 2 tasks, 1 of which fails. The correct logic would be to SnapshotAfter the previous plan's index to ensure consistency. New clusters or newly elected leaders will not have a previous plan, so the index the leader was elected should be used instead.

github-actions · 2023-02-09T02:18:20Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

schmichael requested review from preetapan and dadgar March 12, 2019 21:52

endocrimes added the post-0.9 label Mar 12, 2019

schmichael force-pushed the b-snapshotafter branch from 983a68c to 0c45898 Compare March 12, 2019 22:12

schmichael mentioned this pull request Mar 20, 2019

Add denormalized Eval.DequeueID RPC #5452

Open

schmichael removed the post-0.9 label Apr 16, 2019

schmichael marked this pull request as ready for review May 8, 2019 21:38

dadgar requested changes May 15, 2019

View reviewed changes

nomad/plan_apply.go Outdated Show resolved Hide resolved

nomad/worker.go Show resolved Hide resolved

preetapan reviewed May 17, 2019

View reviewed changes

nomad/state/notify.go Show resolved Hide resolved

preetapan approved these changes May 17, 2019

View reviewed changes

schmichael added 4 commits May 17, 2019 13:30

nomad: refactor waitForIndex into SnapshotAfter

cec0762

Generalize wait for index logic in the state store for reuse elsewhere. Also begin plumbing in a context to combine handling of timeouts and shutdown.

nomad: remove unused NotifyGroup struct

6ca9f0c

I don't think it's been used for a long time.

nomad: wait for state store to sync in plan apply

b5fec1c

Wait for state store to catch up with raft when applying plans.

nomad: emit more detailed error

a2e4f12

Avoid returning context.DeadlineExceeded as it lacks helpful information and is often ignored or handled specially by callers.

schmichael force-pushed the b-snapshotafter branch from 0c45898 to a2e4f12 Compare May 17, 2019 21:38

schmichael requested a review from dadgar May 17, 2019 22:43

dadgar approved these changes May 20, 2019

View reviewed changes

schmichael merged commit 59946ff into master May 20, 2019

schmichael deleted the b-snapshotafter branch May 20, 2019 21:03

schmichael mentioned this pull request Jun 3, 2019

nomad: revert use of SnapshotAfter in planApply #5773

Merged

schmichael mentioned this pull request Jun 24, 2019

nomad: include snapshot index when submitting plans #5791

Merged

github-actions bot locked as resolved and limited conversation to collaborators Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block plan application until state store has caught up to raft #5411

Block plan application until state store has caught up to raft #5411

schmichael commented Mar 12, 2019

preetapan left a comment

github-actions bot commented Feb 9, 2023

Block plan application until state store has caught up to raft #5411

Block plan application until state store has caught up to raft #5411

Conversation

schmichael commented Mar 12, 2019

TODOs

preetapan left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 9, 2023