Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block plan application until state store has caught up to raft #5411

Merged
merged 4 commits into from
May 20, 2019

Conversation

schmichael
Copy link
Member

Do not merge until 0.9.1

This guards against a situation where planApply could dequeue a plan that references objects that are committed to raft but not yet applied to the leader's state store.

To do this I refactored worker.waitForIndex into StateStore.SnapshotAfter (new name welcome) so it could be shared between the 3 call sites. Backoff logic should be maintained although error log lines will vary slightly from the old waitForIndex errors ("timeout" vs "deadline exceeded").

TODOs

  • Unit test planApply - not sure how to control raft and state store's synchronization, but I haven't spent much time digging in. Advice welcome.
  • Integration/performance test - This will be done in a follow up PR.
  • Metrics and/or logging around the planApply callsite
    • Perhaps metrics should be internal to SnapshotAfter and we skip timing the fast path that doesn't block? That would create a nice metric that only measures any lag between the indices and not the non-blocking path.

@schmichael schmichael requested review from preetapan and dadgar March 12, 2019 21:52
@schmichael schmichael marked this pull request as ready for review May 8, 2019 21:38
nomad/plan_apply.go Outdated Show resolved Hide resolved
nomad/worker.go Show resolved Hide resolved
Copy link
Contributor

@preetapan preetapan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall logic lgtm, agree with @dadgar's comment about increasing the timeout

Generalize wait for index logic in the state store for reuse elsewhere.
Also begin plumbing in a context to combine handling of timeouts and
shutdown.
I don't think it's been used for a long time.
Wait for state store to catch up with raft when applying plans.
Avoid returning context.DeadlineExceeded as it lacks helpful information
and is often ignored or handled specially by callers.
@schmichael schmichael requested a review from dadgar May 17, 2019 22:43
@schmichael schmichael merged commit 59946ff into master May 20, 2019
@schmichael schmichael deleted the b-snapshotafter branch May 20, 2019 21:03
schmichael added a commit that referenced this pull request Jun 3, 2019
Revert plan_apply.go changes from #5411

Since non-Command Raft messages do not update the StateStore index,
SnapshotAfter may unnecessarily block and needlessly fail in idle
clusters where the last Raft message is a non-Command message.

This is trivially reproducible with the dev agent and a job that has 2
tasks, 1 of which fails.

The correct logic would be to SnapshotAfter the previous plan's index to
ensure consistency. New clusters or newly elected leaders will not have
a previous plan, so the index the leader was elected should be used
instead.
@github-actions
Copy link

github-actions bot commented Feb 9, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants