Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ephemeral disk: migrate should imply sticky #16826

Merged
merged 2 commits into from
Apr 7, 2023
Merged

Conversation

tgross
Copy link
Member

@tgross tgross commented Apr 7, 2023

The ephemeral_disk block's migrate field allows for best-effort migration of the ephemeral disk data to new nodes. The documentation says the migrate field is only respected if sticky=true, but in fact if client ACLs are not set the data is migrated even if sticky=false.

The existing behavior when client ACLs are disabled has existed since the early implementation, so "fixing" that case now would silently break backwards compatibility. Additionally, having migrate not imply sticky seems nonsensical: it suggests that if we place on a new node we migrate the data but if we place on the same node, we throw the data away!

Update so that migrate=true implies sticky=true as follows:

  • The failure mode when client ACLs are enabled comes from the server not passing along a migration token. Update the server so that the server provides a migration token whenever migrate=true and not just when sticky=true too.
  • Update the scheduler so that migrate implies sticky.
  • Update the client so that we check for migrate || sticky where appropriate.
  • Refactor the E2E tests to move them off the old framework and make the intention of the test more clear.

Fixes #12314

E2E test results:

$ go test -v -count=1 ./nodedrain
=== RUN   TestNodeDrain
=== RUN   TestNodeDrain/IgnoreSystem
    node_drain_test.go:239: alloc has drained from node=b39d3578 after 17.453100094s
=== RUN   TestNodeDrain/EphemeralMigrate
    node_drain_test.go:239: alloc has drained from node=b39d3578 after 16.54164811s
=== RUN   TestNodeDrain/KeepIneligible
--- PASS: TestNodeDrain (50.41s)
    --- PASS: TestNodeDrain/IgnoreSystem (25.99s)
    --- PASS: TestNodeDrain/EphemeralMigrate (22.94s)
    --- PASS: TestNodeDrain/KeepIneligible (1.26s)
PASS
ok      github.com/hashicorp/nomad/e2e/nodedrain        50.413s

The `ephemeral_disk` block's `migrate` field allows for best-effort migration of
the ephemeral disk data to new nodes. The documentation says the `migrate` field
is only respected if `sticky=true`, but in fact if client ACLs are not set the
data is migrated even if `sticky=false`.

The existing behavior when client ACLs are disabled has existed since the early
implementation, so "fixing" that case now would silently break backwards
compatibility. Additionally, having `migrate` not imply `sticky` seems
nonsensical: it suggests that if we place on a new node we migrate the data but
if we place on the same node, we throw the data away!

Update so that `migrate=true` implies `sticky=true` as follows:

* The failure mode when client ACLs are enabled comes from the server not passing
  along a migration token. Update the server so that the server provides a
  migration token whenever `migrate=true` and not just when `sticky=true` too.
* Update the scheduler so that `migrate` implies `sticky`.
* Update the client so that we check for `migrate || sticky` where appropriate.
* Refactor the E2E tests to move them off the old framework and make the intention
  of the test more clear.
@tgross tgross force-pushed the drain-migrate-no-sticky branch from d820318 to 5e97aea Compare April 7, 2023 14:50
@tgross tgross changed the title drain: migrate should imply sticky ephemeral disk: migrate should imply sticky Apr 7, 2023
@tgross tgross marked this pull request as ready for review April 7, 2023 15:16
@tgross tgross requested review from schmichael and shoenig April 7, 2023 15:16
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

nomad/structs/structs.go Outdated Show resolved Hide resolved
scheduler/generic_sched.go Outdated Show resolved Hide resolved
@tgross tgross merged commit b8a472d into main Apr 7, 2023
@tgross tgross deleted the drain-migrate-no-sticky branch April 7, 2023 20:33
tgross added a commit that referenced this pull request Apr 12, 2023
The `ephemeral_disk` block's `migrate` field allows for best-effort migration of
the ephemeral disk data to new nodes. The documentation says the `migrate` field
is only respected if `sticky=true`, but in fact if client ACLs are not set the
data is migrated even if `sticky=false`.

The existing behavior when client ACLs are disabled has existed since the early
implementation, so "fixing" that case now would silently break backwards
compatibility. Additionally, having `migrate` not imply `sticky` seems
nonsensical: it suggests that if we place on a new node we migrate the data but
if we place on the same node, we throw the data away!

Update so that `migrate=true` implies `sticky=true` as follows:

* The failure mode when client ACLs are enabled comes from the server not passing
  along a migration token. Update the server so that the server provides a
  migration token whenever `migrate=true` and not just when `sticky=true` too.
* Update the scheduler so that `migrate` implies `sticky`.
* Update the client so that we check for `migrate || sticky` where appropriate.
* Refactor the E2E tests to move them off the old framework and make the intention
  of the test more clear.
tgross added a commit that referenced this pull request Apr 12, 2023
tgross added a commit that referenced this pull request Apr 12, 2023
tgross added a commit that referenced this pull request Apr 12, 2023
tgross added a commit that referenced this pull request Apr 12, 2023
tgross added a commit that referenced this pull request Apr 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

E2E: TestNodeDrainEphermeralMigrate should work with client ACLs
2 participants