No automatic manager shutdown on demotion/removal #1829

aaronlehmann · 2016-12-22T01:40:38Z

This is an attempt to fix the demotion process. Right now, the agent finds out about a role change, and independently, the raft node finds out it's no longer in the cluster, and shuts itself down. This causes the manager to also shut itself down. This is very error-prone and has led to a lot of problems. I believe there are corner cases that are not properly addressed.

This changes things so that raft only signals to the higher level that the node has been removed. The manager supervision code will shut down the manager, and wait a certain amount of time for a role change (which should come through the agent reconnecting to a different manager now that the local manager is shut down).

In addition, I had to fix a longstanding problem with demotion. Demoting a node involves updating both the node object itself and the raft member list. This can't be done atomically. If the leader shuts itself down when the object is updated but before the raft member list is updated, we end up in an inconsistent state. I fixed this by adding code that reconciles the raft member list against the node objects.

cc @LK4D4

Signed-off-by: Aaron Lehmann [email protected]

aaronlehmann · 2017-01-04T01:43:58Z

I think I understand why this wasn't working. controlapi's UpdateNode handler demotes a node in two steps:

Update the node object in the store (store.UpdateNode)
Remove the raft member (raft.RemoveMember)

There is a comment in the code explaining that this is not done atomically and that can be a problem:

// TODO(abronan): the remove can potentially fail and leave the node with
// an incorrect role (worker rather than manager), we need to reconcile the
// memberlist with the desired state rather than attempting to remove the
// member once.

Basically, this change triggers the problem very often. The demoted manager will shut down as soon as it sees its role change to "worker". This is likely to be in between store.UpdateNode and raft.RemoveMember. This can cause quorum to be lost, because the quorum doesn't change to reflect the demotion until RemoveMember succeeds.

Reordering the calls would fix the problem in some cases, but create other problems. It wouldn't work when demoting the leader, because leader would remove itself from raft before it has a chance to update the node object.

The most correct way I can think of to fix this is to reconcile the member list as the TODO suggests. node.Spec.Role would control whether the node should be part of the raft member list or not. The raft member list would be reconciled against this. Then we could have node.Role, an observed property derived from the raft member list. Certificate renewals would be triggered based on changes to the observed role, not the desired role (and the CA would issue certificates based on observed role). By the time the observed role changes, the member list has been updated, and it's safe to do things like shut down the manager.

My only concern about this approach is that "observed role" (node.Role) could be confusing. In the paragraph above, observed role is based on whether the raft member list has been reconciled. It doesn't show whether the node has renewed its certificate to one that reflects the updated role. But I think this is still worth trying. If necessary, we can call the field something more specific like node.ReconciledRole.

codecov-io · 2017-01-04T19:35:49Z

Current coverage is 54.60% (diff: 44.06%)

Merging #1829 into master will decrease coverage by 0.15%

@@             master      #1829   diff @@
==========================================
  Files           102        103     +1   
  Lines         17050      17150   +100   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           9336       9364    +28   
- Misses         6580       6648    +68   
- Partials       1134       1138     +4

Powered by Codecov. Last update 42baa0d...be580ec

aaronlehmann · 2017-01-04T19:35:59Z

I've added a commit that implements the role reconcilation I described above. Now Node has a Role field outside of spec as well, that reflects the actual role that the CA will respect. On demotion, this field isn't updated until the node is removed from the member list.

CI is passing now. PTAL.

I can split this PR into two if people want me to.

aaronlehmann · 2017-01-05T00:36:47Z

I made two more fixes that seem to help with lingering occasional integration test issues.

First, I had to remove this code from node.go which sets the role to WorkerRole before obtaining a worker certificate:

                               // switch role to agent immediately to shutdown manager early
                               if role == ca.WorkerRole {
                                       n.role = role
                                       n.roleCond.Broadcast()
                               }

In rare cases, this could cause problems with rapid promotion/demotion cycles. If a certificate renewal got kicked off before the demotion (say, by an earlier promotion), the role could flip from worker back to manager, until another renewal happens. This would cause the manager to start and join as a new member, which caused problems in the integration tests.

I also added a commit that adds a Cancel method to raft.Node. This interrupts current and future proposals, which deals with possible deadlocks on shutdown. We shut down services such as the dispatcher, CA, and so on before shutting down raft, because those other service depend on raft. But if one of those services is trying to write to raft and the quorum is not met, its Stop method might block waiting for the write to complete. The Cancel method gives us a way to prevent writes to raft from blocking things during the manager shutdown process. This probably was an issue before this PR, although somehow the change to node.go discussed above makes it much easier to trigger.

This PR is split into 3 commits, and I can open separate PRs for them if necessary.

This seems very solid now. I ran the integration tests 32 times without any failures.

PTAL

cyli · 2017-01-05T19:53:30Z

api/objects.proto

+	// updated after the Raft member list has been reconciled with the
+	// desired role from the spec. Note that this doesn't show whether the
+	// node has obtained a certificate that reflects its current role.
+	NodeRole role = 9;


I agree that your suggestion of reconciled_role would be more clear here, since you're right that "observed" implies (to me at least) the role of the certificate the node is currently using.

My hesitation about this is that the worker is going to act on this field to trigger certificate renewals. "Reconciled role" has meaning to the manager, but not the worker. From the worker's perspective, "Get a new certificate if your role changes" makes sense, but "Get a new certificate if your reconciled role changes" doesn't make as much sense.

What about changing the field under Spec to DesiredRole and leaving this one as Role? I didn't think of that before, but as long as the field number doesn't change, it's fine to rename protobuf fields.

That'd make sense too. :) Thanks

Renamed the Spec field to DesiredRole, thanks.

cyli · 2017-01-05T19:59:09Z

manager/manager.go

@@ -394,33 +401,22 @@ func (m *Manager) Run(parent context.Context) error {
 		if err != nil {
 			errCh <- err


Since errCh doesn't seem to be used any more, should it be removed?

Since errCh doesn't seem to be used any more, should it be removed?

Fixed, thanks

aaronlehmann · 2017-01-05T20:38:55Z

Race revealed by CI fixed in #1846

LK4D4 · 2017-01-05T21:38:09Z

api/specs.proto

-	// Role defines the role the node should have.
-	NodeRole role = 2;
+	// DesiredRole defines the role the node should have.
+	NodeRole desired_role = 2;


isn't this breaking api change?

Not in gRPC. If we rename the field in the JSON API, it would be. Not sure what we should do about that. We could either keep the current name in JSON, or back out this part of the change and keep Role for gRPC as well. Not sure what's best.

No, field number and type stay the same, so this won't break anything.

stevvooe · 2017-01-05T21:41:40Z

api/objects.proto

+	// Role is the *observed* role for this node. It differs from the
+	// desired role set in Node.Spec.Role because the role here is only
+	// updated after the Raft member list has been reconciled with the
+	// desired role from the spec. Note that this doesn't show whether the


Maybe add that if the role is being used to tell whether or not an action may be performed, the certificate should be verified. This field is mostly informational.

Maybe add that if the role is being used to tell whether or not an action may be performed, the certificate should be verified. This field is mostly informational.

Thanks, added this.

stevvooe · 2017-01-05T21:53:24Z

LGTM

aaronlehmann · 2017-01-05T23:54:58Z

This passed Docker integration tests.

cyli · 2017-01-06T00:14:03Z

manager/role_manager.go

+		for _, node := range nodes {
+			rm.reconcileRole(node)
+		}
+		if len(rm.pending) != 0 {


At this point, is rm.pending always empty? So ticker will never be set?
I'm also wondering if the nodes in nodes should be added to pending? If any of them fail to be reconciled, they will not be reconciled again until that particular node is updated, or until some other updated node fails to be reconciled.

Oops, the nodes should indeed be added to pending. Fixed.

cyli · 2017-01-06T00:15:37Z

manager/role_manager.go

+				tickerCh = ticker.C
+			}
+		case <-tickerCh:
+			for _, node := range nodes {


Should this be iterating over rm.pending? nodes doesn't ever get updated, so will this be iterating over the list of nodes in existence when the watch first started?

You are correct. I've fixed this. Thanks for finding these problems.

cyli · 2017-01-06T01:02:01Z

Since promoting/demoting nodes is now eventually consistent, does any logic in the control API need to change with regards to the quorum safeguard? In a 3-node cluster, if you demote 2 nodes in rapid succession, the demotions may both succeed because the raft membership has not changed yet.

Not sure if this is a valid test, but it fails (I put it in manager/controlapi/node_test.go) because the second demote succeeds.

func TestUpdateNodeDemoteTwice(t *testing.T) {
	t.Parallel()
	tc := cautils.NewTestCA(nil)
	defer tc.Stop()
	ts := newTestServer(t)
	defer ts.Stop()

	nodes, _ := raftutils.NewRaftCluster(t, tc)
	defer raftutils.TeardownCluster(t, nodes)

	// Assign one of the raft node to the test server
	ts.Server.raft = nodes[1].Node
	ts.Server.store = nodes[1].MemoryStore()

	// Create a node object for each of the managers
	assert.NoError(t, nodes[1].MemoryStore().Update(func(tx store.Tx) error {
		assert.NoError(t, store.CreateNode(tx, &api.Node{
			ID: nodes[1].SecurityConfig.ClientTLSCreds.NodeID(),
			Spec: api.NodeSpec{
				DesiredRole: api.NodeRoleManager,
				Membership:  api.NodeMembershipAccepted,
			},
			Role: api.NodeRoleManager,
		}))
		assert.NoError(t, store.CreateNode(tx, &api.Node{
			ID: nodes[2].SecurityConfig.ClientTLSCreds.NodeID(),
			Spec: api.NodeSpec{
				DesiredRole: api.NodeRoleManager,
				Membership:  api.NodeMembershipAccepted,
			},
			Role: api.NodeRoleManager,
		}))
		assert.NoError(t, store.CreateNode(tx, &api.Node{
			ID: nodes[3].SecurityConfig.ClientTLSCreds.NodeID(),
			Spec: api.NodeSpec{
				DesiredRole: api.NodeRoleManager,
				Membership:  api.NodeMembershipAccepted,
			},
			Role: api.NodeRoleManager,
		}))
		return nil
	}))

	// Try to demote Nodes 2 and 3 in quick succession, this should fail because of the quorum safeguard
	r2, err := ts.Client.GetNode(context.Background(), &api.GetNodeRequest{NodeID: nodes[2].SecurityConfig.ClientTLSCreds.NodeID()})
	assert.NoError(t, err)
	r3, err := ts.Client.GetNode(context.Background(), &api.GetNodeRequest{NodeID: nodes[3].SecurityConfig.ClientTLSCreds.NodeID()})
	assert.NoError(t, err)

	_, err = ts.Client.UpdateNode(context.Background(), &api.UpdateNodeRequest{
		NodeID: nodes[2].SecurityConfig.ClientTLSCreds.NodeID(),
		Spec: &api.NodeSpec{
			DesiredRole: api.NodeRoleWorker,
			Membership:  api.NodeMembershipAccepted,
		},
		NodeVersion: &r2.Node.Meta.Version,
	})
	assert.NoError(t, err)

	_, err = ts.Client.UpdateNode(context.Background(), &api.UpdateNodeRequest{
		NodeID: nodes[3].SecurityConfig.ClientTLSCreds.NodeID(),
		Spec: &api.NodeSpec{
			DesiredRole: api.NodeRoleWorker,
			Membership:  api.NodeMembershipAccepted,
		},
		NodeVersion: &r3.Node.Meta.Version,
	})
	assert.Error(t, err)
	assert.Equal(t, codes.FailedPrecondition, grpc.Code(err))
}

aaronlehmann · 2017-01-06T01:18:22Z

Thanks for bringing this up - it's an interesting issue and I want to make sure we handle that case right. I took a look, and I think the code in this PR is doing the right thing.

The quorum safeguard in controlapi only applies to changing the desired role, but actually completing the demotion is guarded by another quorum safeguard in roleManager. This code is not concurrent, so there shouldn't be any consistency issues.

I think what we're doing here is a big improvement from the old code, which did have this issue because calls to the controlapi could happen concurrently, and that was the only place the quorum check happened.

The reason your test is failing is that controlapi no longer actually removes the node from the raft member list. You need to have a roleManager running for that to happen. For the purpose of the test, it might be better to explicitly call RemoveMember, so you don't have to rely on a race to see possible issues.

aaronlehmann · 2017-01-06T01:19:40Z

Also, in theory once it's possible to finish the demotion without losing quorum, roleManager would finish it automatically. Which is pretty cool, I think.

aaronlehmann · 2017-01-06T01:21:41Z

But maybe you have a valid point that we shouldn't allow the DesiredRole change to go through if it's something that can't be satisfied immediately, just for the sake of usability. Maybe that's good followup material?

cyli · 2017-01-06T01:37:20Z

I think my numbers are also wrong in the test. It also fails in master, so it's just an invalid test I think. Maybe 5, 1 down, 2 demotions? :) Also, yes you're right, the role manager isn't running, so that'd also be a problem.

But yes, I was mainly wondering if there'd be problem if the control api says all is well, but the demotion can't actually happen (until something else comes back up). Apologies I wasn't clear - I wasn't trying to imply that this PR wasn't eventually doing the demotion.

I don't think I have any other feedback other than the UI thing. :) It LGTM.

cyli · 2017-01-06T01:41:28Z

Also regarding the delay in satisfying the demotion, I guess demotion takes a little time already, so I don't think this PR actually introduces any new issues w.r.t. immediate feedback, so 👍 for thinking about it later on

LK4D4 · 2017-01-07T01:12:23Z

@aaronlehmann LGTM, feel free to merge after rebase.

When a node is demoted, two things need to happen: the node object itself needs to be updated, and the raft member list needs to be updated to remove that node. Previously, it was possible to get into a bad state where the node had been updated but not removed from the member list. This changes the approach so that controlapi only updates the node object, and there is a goroutine that watches for node changes and updates the member list accordingly. This means that demotion will work correctly even if there is a node failure or leader change in between the two steps. Signed-off-by: Aaron Lehmann <[email protected]>

This is an attempt to fix the demotion process. Right now, the agent finds out about a role change, and independently, the raft node finds out it's no longer in the cluster, and shuts itself down. This causes the manager to also shut itself down. This is very error-prone and has led to a lot of problems. I believe there are corner cases that are not properly addressed. This changes things so that raft only signals to the higher level that the node has been removed. The manager supervision code will shut down the manager, and wait a certain amount of time for a role change (which should come through the agent reconnecting to a different manager now that the local manager is shut down). Signed-off-by: Aaron Lehmann <[email protected]>

Shutting down the manager can deadlock if a component is waiting for store updates to go through. This Cancel method allows current and future proposals to be interrupted to avoid those deadlocks. Once everything using raft has shut down, the Stop method can be used to complete raft shutdown. Signed-off-by: Aaron Lehmann <[email protected]>

Signed-off-by: Aaron Lehmann <[email protected]>

aaronlehmann added the status/1-design-review label Dec 22, 2016

aaronlehmann mentioned this pull request Dec 22, 2016

Allow managers not to expose a remote API port #1826

Merged

6 tasks

aaronlehmann force-pushed the correct-demotion branch from 3220c44 to fa998a9 Compare January 4, 2017 19:26

aaronlehmann changed the title ~~[WIP] No automatic manager shutdown on demotion/removal~~ No automatic manager shutdown on demotion/removal Jan 4, 2017

aaronlehmann force-pushed the correct-demotion branch from fa998a9 to a2c96b9 Compare January 4, 2017 22:05

aaronlehmann mentioned this pull request Jan 5, 2017

ca: Don't block server's Stop until handlers finish #1841

Merged

aaronlehmann force-pushed the correct-demotion branch from a2c96b9 to a1cba5b Compare January 5, 2017 00:30

cyli reviewed Jan 5, 2017

View reviewed changes

aaronlehmann force-pushed the correct-demotion branch from a1cba5b to 3af077c Compare January 5, 2017 20:05

LK4D4 reviewed Jan 5, 2017

View reviewed changes

stevvooe reviewed Jan 5, 2017

View reviewed changes

LK4D4 added status/2-code-review and removed status/1-design-review labels Jan 5, 2017

aaronlehmann force-pushed the correct-demotion branch 2 times, most recently from 167b493 to f333714 Compare January 5, 2017 22:02

cyli reviewed Jan 6, 2017

View reviewed changes

aaronlehmann force-pushed the correct-demotion branch 2 times, most recently from 7536335 to cee1cc0 Compare January 6, 2017 00:31

aaronlehmann force-pushed the correct-demotion branch 2 times, most recently from 6b60061 to ce45628 Compare January 6, 2017 22:29

aaronlehmann added 5 commits January 6, 2017 17:14

Rename Node.Spec.Role to Node.Spec.DesiredRole

59b1a0a

Signed-off-by: Aaron Lehmann <[email protected]>

Clarify comment on Node.Role

be580ec

Signed-off-by: Aaron Lehmann <[email protected]>

aaronlehmann force-pushed the correct-demotion branch from ce45628 to be580ec Compare January 7, 2017 01:14

aaronlehmann merged commit 0bc3921 into moby:master Jan 7, 2017

aaronlehmann deleted the correct-demotion branch January 7, 2017 01:23

cyli mentioned this pull request Mar 21, 2018

Node Promotion/Demotion workflow review #2565

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No automatic manager shutdown on demotion/removal #1829

No automatic manager shutdown on demotion/removal #1829

aaronlehmann commented Dec 22, 2016 •

edited

Loading

aaronlehmann commented Jan 4, 2017

codecov-io commented Jan 4, 2017 •

edited

Loading

aaronlehmann commented Jan 4, 2017

aaronlehmann commented Jan 5, 2017

cyli Jan 5, 2017

aaronlehmann Jan 5, 2017

cyli Jan 5, 2017

aaronlehmann Jan 5, 2017

cyli Jan 5, 2017

aaronlehmann Jan 5, 2017

aaronlehmann commented Jan 5, 2017

LK4D4 Jan 5, 2017

aaronlehmann Jan 5, 2017

stevvooe Jan 5, 2017

stevvooe Jan 5, 2017

aaronlehmann Jan 5, 2017

stevvooe commented Jan 5, 2017

aaronlehmann commented Jan 5, 2017

cyli Jan 6, 2017

aaronlehmann Jan 6, 2017

cyli Jan 6, 2017

aaronlehmann Jan 6, 2017

cyli commented Jan 6, 2017 •

edited

Loading

aaronlehmann commented Jan 6, 2017

aaronlehmann commented Jan 6, 2017

aaronlehmann commented Jan 6, 2017

cyli commented Jan 6, 2017 •

edited

Loading

cyli commented Jan 6, 2017

LK4D4 commented Jan 7, 2017

		@@ -394,33 +401,22 @@ func (m *Manager) Run(parent context.Context) error {
		if err != nil {
		errCh <- err

No automatic manager shutdown on demotion/removal #1829

No automatic manager shutdown on demotion/removal #1829

Conversation

aaronlehmann commented Dec 22, 2016 • edited Loading

aaronlehmann commented Jan 4, 2017

codecov-io commented Jan 4, 2017 • edited Loading

Current coverage is 54.60% (diff: 44.06%)

aaronlehmann commented Jan 4, 2017

aaronlehmann commented Jan 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronlehmann commented Jan 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevvooe commented Jan 5, 2017

aaronlehmann commented Jan 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyli commented Jan 6, 2017 • edited Loading

aaronlehmann commented Jan 6, 2017

aaronlehmann commented Jan 6, 2017

aaronlehmann commented Jan 6, 2017

cyli commented Jan 6, 2017 • edited Loading

cyli commented Jan 6, 2017

LK4D4 commented Jan 7, 2017

aaronlehmann commented Dec 22, 2016 •

edited

Loading

codecov-io commented Jan 4, 2017 •

edited

Loading

cyli commented Jan 6, 2017 •

edited

Loading

cyli commented Jan 6, 2017 •

edited

Loading