Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: avoid acting on stale data after launch #10907

Merged
merged 2 commits into from
Jul 20, 2021

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Jul 17, 2021

When the client launches, use a consistent read to fetch its own allocs,
but allow stale read afterwards as long as reads don't revert into older
state.

This change addresses an edge case affecting restarting clients. When a
client restarts, it may fetch a stale data concerning its allocs: allocs
that have completed prior to the client shutdown may still have "run/running"
desired/client status, and have the client attempt to re-run again.

An alternative approach is to track the indices such that the client
set MinQueryIndex on the maximum index the client ever saw, or compare
received allocs against locally restored client state. Garbage
collection complicates this approach (local knowledge is not complete),
and the approach still risks starting "dead" allocations (e.g. the
allocation may have been placed when client just restarted and have
already been rescheduled by the time the client started. This approach
here is effective against all kinds of staleness problems with small
overhead.

Fixes #10901

When the client launches, use a consistent read to fetch its own allocs,
but allow stale read afterwards as long as reads don't revert into older
state.

This change addresses an edge case affecting restarting client. When a
client restarts, it may fetch a stale data concerning its allocs: allocs
that have completed prior to the client shutdown may still have "run/running"
desired/client status, and have the client attempt to re-run again.

An alternative approach is to track the indices such that the client
set MinQueryIndex on the maximum index the client ever saw, or compare
received allocs against locally restored client state. Garbage
collection complicates this approach (local knowledge is not complete),
and the approach still risks starting "dead" allocations (e.g. the
allocation may have been placed when client just restarted and have
already been reschuled by the time the client started. This approach
here is effective against all kinds of stalness problems with small
overhead.
@notnoop notnoop added this to the 1.1.3 milestone Jul 17, 2021
@notnoop notnoop self-assigned this Jul 17, 2021
Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; nice find!

@notnoop notnoop merged commit 3165ae8 into main Jul 20, 2021
@notnoop notnoop deleted the b-client-allocs-consistency branch July 20, 2021 19:13
@notnoop notnoop mentioned this pull request Jul 28, 2021
notnoop pushed a commit that referenced this pull request Jul 29, 2021
When the client launches, use a consistent read to fetch its own allocs,
but allow stale read afterwards as long as reads don't revert into older
state.

This change addresses an edge case affecting restarting client. When a
client restarts, it may fetch a stale data concerning its allocs: allocs
that have completed prior to the client shutdown may still have "run/running"
desired/client status, and have the client attempt to re-run again.

An alternative approach is to track the indices such that the client
set MinQueryIndex on the maximum index the client ever saw, or compare
received allocs against locally restored client state. Garbage
collection complicates this approach (local knowledge is not complete),
and the approach still risks starting "dead" allocations (e.g. the
allocation may have been placed when client just restarted and have
already been reschuled by the time the client started. This approach
here is effective against all kinds of stalness problems with small
overhead.
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nomad tries to call RecoverTask on an allocation it has GC'd minutes ago; restarts tasks it shouldn't
4 participants