-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nomad: include snapshot index when submitting plans #5791
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plans must also be committed serially, so Plan N+1 should use a state
snapshot containing Plan N. This is guaranteed for plansafter
the
first plan after a leader election.
Curious how is that guaranteed now? If two workers submitted independent non-conflicting plans with the same snapshot index N -1; planner would apply first plan and commit with index N, where does it wait until N is commited and applied to State before evaluating the second plan?
6f66fd3
to
31ea934
Compare
Plan application should use a state snapshot at or after the Raft index at which the plan was created otherwise it risks being rejected based on stale data. This commit adds a Plan.SnapshotIndex which is set by workers when submitting plan. SnapshotIndex is set to the Raft index of the snapshot the worker used to generate the plan. Plan.SnapshotIndex plays a similar role to PlanResult.RefreshIndex. While RefreshIndex informs workers their StateStore is behind the leader's, SnapshotIndex is a way to prevent the leader from using a StateStore behind the worker's. Plan.SnapshotIndex should be considered the *lower bound* index for consistently handling plan application. Plans must also be committed serially, so Plan N+1 should use a state snapshot containing Plan N. This is guaranteed for plans *after* the first plan after a leader election. The Raft barrier on leader election ensures the leader's statestore has caught up to the log index at which it was elected. This guarantees its StateStore is at an index > lastPlanIndex.
The previous commit prevented evaluating plans against a state snapshot which is older than the snapshot at which the plan was created. This is correct and prevents failures trying to retrieve referenced objects that may not exist until the plan's snapshot. However, this is insufficient to guarantee consistency if the following events occur: 1. P1, P2, and P3 are enqueued with snapshot @ 100 2. Leader evaluates and applies Plan P1 with snapshot @ 100 3. Leader evaluates Plan P2 with snapshot+P1 @ 100 4. P1 commits @ 101 4. Leader evaluates applies Plan P3 with snapshot+P2 @ 100 Since only the previous plan is optimistically applied to the state store, the snapshot used to evaluate a plan may not contain the N-2 plan! To ensure plans are evaluated and applied serially we must consider all previous plan's committed indexes when evaluating further plans. Therefore combined with the last PR, the minimum index at which to evaluate a plan is: min(previousPlanResultIndex, plan.SnapshotIndex)
Rename SnapshotAfter to SnapshotMinIndex. The old name was not technically accurate. SnapshotAtOrAfter is more accurate, but wordy and still lacks context about what precisely it is at or after (the index). SnapshotMinIndex was chosen as it describes the action (snapshot), a constraint (minimum), and the object of the constraint (index).
31ea934
to
bcad687
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks good to me but would like to do one more round of review if no-one else does.
const timeout = 5 * time.Second | ||
ctx, cancel := context.WithTimeout(context.Background(), timeout) | ||
snap, err := p.fsm.State().SnapshotMinIndex(ctx, minIndex) | ||
cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use a defer for the cancel invocation, it's a little more idomatic and is possible now that you've extracted the function.
If 3 Plans (P1, P2, and P3) are all committed with the same priority and snapshot index the following could occur (note that the order of the plans is arbitrary; concurrent workers could have submitted them in any order).
Leader now evaluates P3 against a state containing both P1 (from index 101) and P2 (from optimistic insert) and the logic proceeds. Even if SnapshotIndex jumps wildly between future and past values (relative to the leader's state index) depending on queue depth, plan priority, and/or worker speed, the optimistic insert and waiting on plan commit index ensures plans are evaluated against a state snapshot containing all previous plans. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm - thank you for following through this. The code comments explains the logic here so well, thanks!
Old description of `{plan,worker}.wait_for_index` described the metric in terms of waiting for a snapshot which has two problems: 1. "Snapshot" is an overloaded term in Nomad and operators can't be expected to know which use we're referring to here. 2. The most important thing about the metric is what we're waiting *on* before taking a snapshot: the raft index of the object to be processed (plan or eval). The new description tries to cram all of that context into the tiny space provided. See #5791 for details about the `wait_for_index` mechanism in general.
Old description of `{plan,worker}.wait_for_index` described the metric in terms of waiting for a snapshot which has two problems: 1. "Snapshot" is an overloaded term in Nomad and operators can't be expected to know which use we're referring to here. 2. The most important thing about the metric is what we're waiting *on* before taking a snapshot: the raft index of the object to be processed (plan or eval). The new description tries to cram all of that context into the tiny space provided. See #5791 for details about the `wait_for_index` mechanism in general.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
tl;dr
Ensure leader's state snapshot is at or after
max(previousPlanResultIndex, plan.SnapshotIndex)
when evaluating and applying a plan.Background
Workers receive evals, create plans, and submit them to the leader for evaluation and applying. This scheduling pipeline runs concurrent with Raft over RPC between server agents to provide optimistic parallelism. The leader is responsible for evaluating workers' plans by ensuring they do not conflict and then applying them. Since this entire process is concurrent with Raft consensus, certain invariants must be met by the leader to ensure plans are evaluated and applied serially:
Implementation
The first invariant is enforced by ensuring the state snapshot used to evaluate plans is >= the snapshot at which the plan was created. (new in 1d670f2)
The second invariant is enforced through 3 mechanisms:
If either invariant cannot be met the plan fails and the worker must reprocess the evaluation.
Previous Attempt
The previous attempt in #5411 attempted to enforce the invariants by ensuring the leader's state had caught up to the latest Raft index. It also contained some implementation bugs, but the concept was a valid way to enforce the invariants. However, it suffered from a couple issues: