You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's been reported privately that jobs in the non-default namespace can cause node drains to hang. This doesn't appear to be reproducible unless the jobs are in the non-default namespace. The behavior has been reported against Nomad 1.0.2 and reproduced on Nomad 1.0.4.
Similar symptoms to #7432, but in that case the job was in the default namespace.
This is easily reproducible with a cluster with two clients. Create 3 namespaces:reate
Check out client status and drain the first node before we run any jobs (to ensure they all end up on the same client):
$ nomad node status
ID DC Name Class Drain Eligibility Status
6cb8547c dc1 ip-172-31-7-19 <none> false eligible ready
bb027927 dc1 ip-172-31-15-185 <none> false eligible ready
# drain the first node
$ nomad node drain -enable -yes 6cb8547c
2021-03-12T14:33:45-05:00: Ctrl-C to stop monitoring: will not cancel the node drain
2021-03-12T14:33:45-05:00: Node "6cb8547c-8de2-ec90-453f-63f6b6abd572" drain strategy set
2021-03-12T14:33:45-05:00: Drain complete for node 6cb8547c-8de2-ec90-453f-63f6b6abd572
2021-03-12T14:33:45-05:00: All allocations on node "6cb8547c-8de2-ec90-453f-63f6b6abd572" have stopped
Run all three job:
$ nomad job run ./example1.nomad
==> Monitoring evaluation "308e5823"
Evaluation triggered by job "example"
==> Monitoring evaluation "308e5823"
Evaluation within deployment: "2eb1d3bd"
Allocation "a5bb2d98" created: node "bb027927", group "web"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "308e5823" finished with status "complete"
$ nomad job run ./example2.nomad
==> Monitoring evaluation "7d69aa4d"
Evaluation triggered by job "example"
==> Monitoring evaluation "7d69aa4d"
Evaluation within deployment: "589639ce"
Allocation "294844e2" created: node "bb027927", group "web"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "7d69aa4d" finished with status "complete"
$ nomad job run ./example3.nomad
==> Monitoring evaluation "bdb9f1f1"
Evaluation triggered by job "example"
==> Monitoring evaluation "bdb9f1f1"
Evaluation within deployment: "c7437065"
Allocation "236225aa" created: node "bb027927", group "web"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "bdb9f1f1" finished with status "complete"
# verify they're running
$ nomad job status -namespace ns1
ID Type Priority Status Submit Date
example service 50 running 2021-03-12T14:34:13-05:00
$ nomad job status -namespace ns2
ID Type Priority Status Submit Date
example service 50 running 2021-03-12T14:34:16-05:00
$ nomad job status -namespace ns3
ID Type Priority Status Submit Date
example service 50 running 2021-03-12T14:34:19-05:00
Unset the node drain to give us a place to migrate the allocs:
Drain the node that has the allocs. It will reliably hang here:
$ nomad node drain -enable -yes -deadline 5m bb027927
2021-03-12T14:35:48-05:00: Ctrl-C to stop monitoring: will not cancel the node drain
2021-03-12T14:35:48-05:00: Node "bb027927-354b-cab2-776d-692cbc24d131" drain strategy set
2021-03-12T14:35:49-05:00: Alloc "236225aa-bea5-26ca-aa12-80a4d6985c60" marked for migration
2021-03-12T14:35:49-05:00: Alloc "294844e2-26b2-6926-0781-64b88a897d67" marked for migration
2021-03-12T14:35:49-05:00: Alloc "a5bb2d98-4d66-8741-6c27-5f1732cf2f38" marked for migration
2021-03-12T14:35:49-05:00: Alloc "a5bb2d98-4d66-8741-6c27-5f1732cf2f38" draining
2021-03-12T14:35:54-05:00: Alloc "a5bb2d98-4d66-8741-6c27-5f1732cf2f38" status running -> complete
If we check the drain status, we see one job has drained, but not the others:
$ nomad job status -namespace ns1 example
...
Allocations
ID Node ID Task Group Version Desired Status Created Modified
6d18ef66 6cb8547c web 0 run running 37s ago 18s ago
a5bb2d98 bb027927 web 0 stop complete 2m13s ago 31s ago
$ nomad job status -namespace ns2 example
...
Allocations
ID Node ID Task Group Version Desired Status Created Modified
294844e2 bb027927 web 0 run running 2m43s ago 2m33s ago
$ nomad job status -namespace ns3 example
...
Allocations
ID Node ID Task Group Version Desired Status Created Modified
236225aa bb027927 web 0 run running 3m1s ago 2m50s ago
After the 5 minute deadline passes, one more of the allocations drains and the node reports that the drain is now complete. But the last allocation is still running on the node:
...
2021-03-12T14:40:49-05:00: Drain complete for node bb027927-354b-cab2-776d-692cbc24d131
2021-03-12T14:40:49-05:00: Alloc "294844e2-26b2-6926-0781-64b88a897d67" draining
2021-03-12T14:40:54-05:00: Alloc "294844e2-26b2-6926-0781-64b88a897d67" status running -> complete
I've pulled both the Nomad server logs and the Nomad client logs and found nothing of particular note, so this isn't throwing unexpected errors. I've grabbed a debug bundle and uploaded it here (this cluster was for testing only and has been destroyed, so nothing private is exposed).
This fixes a bug affecting drain nodes, where allocs may fail to be
migrated if they belong to different namespaces but share the same job
name.
The reason is that the helper function that creates the migration evals
indexed the allocs by job ID without accounting for the namespaces.
When job ids clash, only an eval is created for one and the rest of the
allocs remain intact.
Fixes#10172
This fixes a bug affecting drain nodes, where allocs may fail to be
migrated if they belong to different namespaces but share the same job
name.
The reason is that the helper function that creates the migration evals
indexed the allocs by job ID without accounting for the namespaces.
When job ids clash, only an eval is created for one and the rest of the
allocs remain intact.
Fixes#10172
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
It's been reported privately that jobs in the non-default namespace can cause node drains to hang. This doesn't appear to be reproducible unless the jobs are in the non-default namespace. The behavior has been reported against Nomad 1.0.2 and reproduced on Nomad 1.0.4.
Similar symptoms to #7432, but in that case the job was in the default namespace.
This is easily reproducible with a cluster with two clients. Create 3 namespaces:reate
Prepare the following minimal jobspec. Create 3 versions of this, one for each namespace.
jobspec
Check out client status and drain the first node before we run any jobs (to ensure they all end up on the same client):
Run all three job:
Unset the node drain to give us a place to migrate the allocs:
Drain the node that has the allocs. It will reliably hang here:
If we check the drain status, we see one job has drained, but not the others:
After the 5 minute deadline passes, one more of the allocations drains and the node reports that the drain is now complete. But the last allocation is still running on the node:
I've pulled both the Nomad server logs and the Nomad client logs and found nothing of particular note, so this isn't throwing unexpected errors. I've grabbed a debug bundle and uploaded it here (this cluster was for testing only and has been destroyed, so nothing private is exposed).
nomad-debug-2021-03-12-193800Z.tar.gz
The text was updated successfully, but these errors were encountered: