-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix panic draining when alloc on non-existent node #4215
Conversation
just compiled this and pushed to flapping production cluster. it became a leader and stopped flapping! |
@burdandrei Awesome, thanks for confirming the fix! I saw your comment over on 4207 as well, just wanted to let you know we'll be cutting another release soon (on the order of days) with this fix in it. |
@@ -125,6 +125,11 @@ func (n *drainingNode) DrainingJobs() ([]structs.NamespacedID, error) { | |||
n.l.RLock() | |||
defer n.l.RUnlock() | |||
|
|||
// Should never happen | |||
if n.node == nil || n.node.DrainStrategy == nil { | |||
return nil, fmt.Errorf("node doesn't have a drain strategy set") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is returning an error much better than panicking because won't it cause leader flapping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be safe how we use it: https://github.com/hashicorp/nomad/blob/master/nomad/drainer/watch_nodes.go#L51-L71
Update will only be called if the node is draining. So it is more going to help future uses not introduce bad logic.
great! I encountered it yesterday in our production environment :( |
Close one, I encountered this when testing draining in our staging environment. Thanks for the fix. I will be upgrading our prod cluster from 0.8.1 to 0.8.3 ASAP. |
I have upgraded to 0.8.4 and cluster was running fine for few days .. We experienced some issues and since then I am not able to rebuild cluster.. Aug 31 12:38:22 ip-10-40-240-80 nomad[16228]: 2018/08/31 12:38:22 [DEBUG] serf: messageJoinType: ip-10-40-240-80.eu-west-1 ========================= nomad server members Error determining leaders: 1 error(s) occurred:
===================================== |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR fixes a panic that occured when a job was being drained and the job has
an allocation that references a garbage collected node.
Added a unit test around the function that was panicking and an integration test that adds an allocation pointing to a non-existent node for all types of jobs and causes a drain. This tests that all parts of the code base handles the nil node (both tests panic without the fix and pass with the fix).
Fixes #4207