Fix leaking task resources when nodes are deleted #2806

dperny · 2019-01-14T18:43:13Z

Signed-off-by: Drew Erny [email protected]

- What I did

When a node is deleted, its tasks are asked to restart, which involves putting them into a desired state of Shutdown. However, the Allocator will not deallocate a task which is not in an actual state of a terminal state. Once a node is deleted, the only opportunity for its tasks to recieve updates and be moved to a terminal state is when the function moving those tasks to TaskStateOrphaned is called, 24 hours after the node enters the Down state. However, if a leadership change occurs, then that function will never be called, and the tasks will never be moved to a terminal state, leaking resources.

With this change, upon node deletion, all of its tasks will be moved to TaskStateOrphaned, allowing those tasks' resources to be cleaned up.

- How I did it

Altered the controlapi to move all tasks belonging to the node to the actual state of Orphaned.

- How to test it

Includes a unit test. However, testing for the full failure condition is nontrivial and likely not worth the effort.

- Description for the changelog

Fix bug where deleting nodes could result in task resources leaking.

When a node is deleted, its tasks are asked to restart, which involves putting them into a desired state of Shutdown. However, the Allocator will not deallocate a task which is not in an actual state of a terminal state. Once a node is deleted, the only opportunity for its tasks to recieve updates and be moved to a terminal state is when the function moving those tasks to TaskStateOrphaned is called, 24 hours after the node enters the Down state. However, if a leadership change occurs, then that function will never be called, and the tasks will never be moved to a terminal state, leaking resources. With this change, upon node deletion, all of its tasks will be moved to TaskStateOrphaned, allowing those tasks' resources to be cleaned up. Signed-off-by: Drew Erny <[email protected]>

codecov · 2019-01-14T19:18:49Z

Codecov Report

Merging #2806 into master will increase coverage by 0.01%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master    #2806      +/-   ##
==========================================
+ Coverage   61.93%   61.94%   +0.01%     
==========================================
  Files         137      137              
  Lines       22131    22132       +1     
==========================================
+ Hits        13706    13710       +4     
+ Misses       6944     6940       -4     
- Partials     1481     1482       +1

anshulpundir

Are there any other implications of deleting tasks right away instead of waiting for 24hrs?

anshulpundir · 2019-01-17T22:11:26Z

manager/controlapi/node_test.go

 	ts.Store.View(func(tx store.ReadTx) {
 		tasks, err := store.FindTasks(tx, store.All)
 		require.NoError(t, err)
 		// should only be 3 tasks left
 		require.Len(t, tasks, 4)
-		// and the list should not contain task1
+		// and the list should not contain task1 or task2


if the task1 and task2 should have been removed, why are we testing for those tasks below?

anshulpundir · 2019-01-17T22:12:03Z

manager/controlapi/node_test.go

@@ -1089,22 +1087,24 @@ func TestRemoveNodeAttachments(t *testing.T) {

 	// Now, call the function with our nodeID. make sure it returns no error
 	err = ts.Store.Update(func(tx store.Tx) error {
-		return removeNodeAttachments(tx, "id2")
+		return orphanNodeTasks(tx, "id2")


the orphaned tasks are deleted by the task reaper, correct? Can we also test for task deletions?

full diff: moby/swarmkit@7dded76...a8bbe7d changes included: - moby/swarmkit#2867 Only update non-terminal tasks on node removal - related to moby/swarmkit#2806 Fix leaking task resources when nodes are deleted - moby/swarmkit#2880 Bump to golang 1.12.9 - moby/swarmkit#2886 Bump vendoring to match current docker/docker master - regenerates protobufs - moby/swarmkit#2890 Remove hardcoded IPAM config subnet value for ingress network - fixes [ENGORC-2651] Specifying --default-addr-pool for docker swarm init is not picked up by ingress network Signed-off-by: Sebastiaan van Stijn <[email protected]>

anshulpundir suggested changes Jan 17, 2019

View reviewed changes

dperny merged commit 18e7e58 into moby:master Mar 21, 2019

thaJeztah mentioned this pull request Sep 18, 2019

bump swarmkit to a8bbe7de43b42fc3a8430819ed577e1218dc64aa moby/moby#39953

Closed

dperny mentioned this pull request Dec 13, 2019

Bump swarmkit to 24fb4cfe8af56803640180c5592bf32da732ced2 moby/moby#40309

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix leaking task resources when nodes are deleted #2806

Fix leaking task resources when nodes are deleted #2806

dperny commented Jan 14, 2019

codecov bot commented Jan 14, 2019

anshulpundir left a comment

anshulpundir Jan 17, 2019

anshulpundir Jan 17, 2019

Fix leaking task resources when nodes are deleted #2806

Fix leaking task resources when nodes are deleted #2806

Conversation

dperny commented Jan 14, 2019

codecov bot commented Jan 14, 2019

Codecov Report

anshulpundir left a comment

Choose a reason for hiding this comment

anshulpundir Jan 17, 2019

Choose a reason for hiding this comment

anshulpundir Jan 17, 2019

Choose a reason for hiding this comment