New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Remove task serialization and use host resource manager for task resources #3723

Merged

prateekchaudhry merged 12 commits into aws:feature/task-resource-accounting from prateekchaudhry:taskAccounting

Jun 1, 2023

Contributor

prateekchaudhry commented May 30, 2023 •

edited

Loading

Summary

ECS Agent ensures host resources to be available on the instance before running a task. Currently this is implemented through serialization - when scheduling a task, agent waits for all previously stopping tasks, i.e. with StopSequenceNumber of payload from acs less than seqnum of payload of the requested task, to stop.

This PR removes this serialization behavior and instead uses HostResourceManager to schedule tasks and a FIFO task queue built into docker_task_engine to queue tasks instead. This will hence use cpu, memory, ports(tcp/udp) and number of gpus available to manage tasks, and start progressing tasks as soon as resources for them start becoming available - instead of all stopping tasks to stop.

Implementation details

Removes package sequential_waitgroup, and references related to StartSequenceNumber and StopSequenceNumber which are constructs related to task serialization
Tasks get queued in a waitingTaskQueue and wait for host resources (managed through HostResourceManager) to become available. A goroutine monitorQueuedTasks dequeues and starts waking up each of the waiting the tasks as and when resources start becoming available. When it can not dequeue anymore because resources are not available, it waits
When a task stops or when a new task arrives, it wakes up the monitorQueuedTasks in case it is blocked
Management of host resources When a task gets resources accounted for by the monitorQueuedTasks, resources are consumed. When a task changes it knownStatus to STOPPED and emits a change of state, resources are released
For Agent restarts, there is a reconcileHostResources implemented during synchronizeState which synchronizes HostResourceManager data structures according to known task states. If any container has been known to progressed beyond ContainerStatusNone state, then host resources are consumed.

Related PRs

Related Containers Roadmap Issue

aws/containers-roadmap#325

Testing

Manually tested reconciliation behavior with agent restarts and verified resources are allocated correctly from agent logs

level=debug time=2023-06-01T00:55:50Z msg="Task host resources to account for" MEMORY=1024 PORTS_TCP=[] PORTS_UDP=[] GPU=0 taskArn="arn:aws:ecs:us-west-2:<>:task/taskAccounting/..." CPU=1024

New tests cover the changes: Yes
TestTaskWaitForHostResources unit test to test task queueing/dequeuing

Description for the changelog

Remove task serialization and use host resource manager for task resources

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

prateekchaudhry added 7 commits

May 25, 2023 10:18


          remove seq wait group which implements serialization (broken)

9398f3c


          use host resource manager and queuing for task scheduing

58e1cfe


          refactor and fix docker_engine unit test

1c9224e


          fix task_manager unit test

98b8ff6


          Add unit test TestTaskWaitForHostResources

46b0a91


          update comments

62f4dce


          fix unit tests

7e6f504

prateekchaudhry requested a review from a team as a code owner

May 30, 2023 02:55

prateekchaudhry added the bot/test label

amazon-ecs-bot removed the bot/test label

prateekchaudhry added 2 commits

May 29, 2023 22:42


          fix tcp ports and internal tasks

3ed1ec2


          fix empty subslice removal

c7e3b86

prateekchaudhry force-pushed the taskAccounting branch from 098ee63 to c7e3b86 Compare

May 30, 2023 06:24

prateekchaudhry added the bot/test label

amazon-ecs-bot removed the bot/test label

prateekchaudhry added the bot/test label

amazon-ecs-bot removed the bot/test label

prateekchaudhry force-pushed the taskAccounting branch from 949e804 to eff922e Compare

May 30, 2023 08:43

prateekchaudhry added the bot/test label

amazon-ecs-bot removed the bot/test label


          change integ test task cpu val

37f1f72

prateekchaudhry force-pushed the taskAccounting branch from eff922e to 37f1f72 Compare

May 30, 2023 16:12

prateekchaudhry changed the title ~~[WIP] Remove task serialization and use host resource manager for scheduling~~ Remove task serialization and use host resource manager for scheduling

prateekchaudhry added the bot/test label

amazon-ecs-bot removed the bot/test label

prateekchaudhry changed the title ~~Remove task serialization and use host resource manager for scheduling~~ Remove task serialization and use host resource manager for task resources

Yiyuanzzz previously approved these changes

View reviewed changes

fierlion reviewed

View reviewed changes

agent/engine/docker_task_engine.go

+              		// Call to release here for stopped tasks should always succeed
+              		// Idempotent release call
+              		if taskStatus.Terminal() {
+              			err := engine.hostResourceManager.release(task.Arn, resources)

Member

fierlion May 30, 2023 •

edited

Loading

when will this be useful? Will it only ever be a no-op if the agent is always starting from zero?
It would help if you could offer an example of where this might be useful in future.

Contributor Author

prateekchaudhry May 31, 2023

Not particularly useful right now, but this is a generalized implementation for keeping HostResourceManager in sync with engine. So this might find more uses in future, such as to keep periodic sync between engine and resource manager.

yinyic reviewed

View reviewed changes

agent/engine/docker_task_engine.go Outdated

+              		// Consume host resources if task has progressed
+              		// Call to consume here should always succeed
+              		// Idempotent consume call
+              		if !task.IsInternal && taskStatus > apitaskstatus.TaskCreated {

Contributor

yinyic May 30, 2023

We want to check for taskStatus == apitaskstatus.TaskCreated | TaskRunning

agent/engine/docker_task_engine.go Outdated

Comment on lines 332 to 338

+              		waitingTaskQueueSingleLen := false
+              		engine.waitingTasksLock.Lock()
+              		waitingTaskQueueSingleLen = len(engine.waitingTaskQueue) == 1
+              		engine.waitingTasksLock.Unlock()
+              		if waitingTaskQueueSingleLen {
+              			engine.monitorQueuedTaskEvent <- struct{}{}
+              		}

Contributor

yinyic May 31, 2023

Not quite following this logic - I think it's sufficient to wake up the queue when we enqueue and when a task stops

Contributor Author

prateekchaudhry May 31, 2023

Ack - making it the channel buffered and a no-op/empty default

agent/engine/docker_task_engine.go Outdated

+              					break
+              				}
+              			}
+              			logger.Debug("No more tasks in Waiting Task Queue, waiting for new tasks")

Contributor

yinyic May 31, 2023

Nit - no more tasks could be started at this moment

agent/engine/docker_task_engine.go Outdated

Comment on lines 394 to 395

		consumable, err := engine.hostResourceManager.consumableSafe(taskHostResources)
		if err != nil {

Contributor

yinyic May 31, 2023

nit- redundant check

fierlion previously approved these changes

View reviewed changes

Member

fierlion left a comment

I'd like to see another test or three covering the engine as an opaque box. You can add this as a follow up.

prateekchaudhry dismissed stale reviews from fierlion and Yiyuanzzz via

539ae10

May 31, 2023 20:12

prateekchaudhry added the bot/test label

amazon-ecs-bot removed the bot/test label

Yiyuanzzz reviewed

View reviewed changes

agent/engine/docker_task_engine.go Outdated

+              		resourcesToRelease := task.ToHostResources()
+              		err := engine.hostResourceManager.release(task.Arn, resourcesToRelease)
+              		if err != nil {
+              			logger.Critical("Failed to release resources after tast stopped", logger.Fields{field.TaskARN: task.Arn})

Contributor

Yiyuanzzz May 31, 2023

non-blocking: small typo here "tast"


          simplify wakeup queue monitor

a7c2340

prateekchaudhry force-pushed the taskAccounting branch from 880f7ba to a7c2340 Compare

May 31, 2023 23:58


          use container status for reconciliation

b29e494

prateekchaudhry added the bot/test label

amazon-ecs-bot removed the bot/test label

Yiyuanzzz approved these changes

View reviewed changes

yinyic reviewed

View reviewed changes

agent/engine/docker_task_engine.go

Comment on lines +326 to +327

		// Always wakes up when at least one event arrives on buffered channel monitorQueuedTaskEvent
		// but does not block if monitorQueuedTasks is already processing queued tasks

Contributor

yinyic Jun 1, 2023

Can we elaborate a little more in the comment on 1) when we will be invoking this method (who will be sending messages onto the channel), and 2) why is buffer size of one sufficient (why's it okay to drop any additional messages)

agent/engine/docker_task_engine.go

Comment on lines +597 to +598

		// Before starting managedTask goroutines, pre-allocate resources for already running
		// tasks in host resource manager

Contributor

yinyic Jun 1, 2023

Not exactly "already running", more like tasks that have progressed beyond the resource consumption check

Contributor Author

prateekchaudhry Jun 1, 2023

Yes, will update these in follow up PR

yinyic approved these changes

View reviewed changes

Contributor

yinyic left a comment

Both comments are on comments, gonna approve to unblock. Please update the comments in the next PR (I'm assuming we'll have a follow-up with more/updated tests)

prateekchaudhry merged commit e00484f into aws:feature/task-resource-accounting

prateekchaudhry added a commit that referenced this pull request


          Remove task serialization and use host resource manager for task reso…

48887fc

…urces (#3723)

This was referenced Jun 9, 2023

Add integ tests for task accounting #3741

Merged

Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order #3747

Merged

prateekchaudhry added a commit that referenced this pull request


          Remove task serialization and use host resource manager for task reso…

bb7133e

…urces (#3723)

prateekchaudhry mentioned this pull request

Merge Feature/task-resource-accounting to dev #3757

Merged

prateekchaudhry added a commit that referenced this pull request


          Remove task serialization and use host resource manager for task reso…

0a4673a

…urces (#3723)

prateekchaudhry mentioned this pull request

Release 1.73.0 #3759

Merged

sparrc added a commit to sparrc/amazon-ecs-agent that referenced this pull request


          Revert "Remove task serialization and use host resource manager for t…

6181ba6

…ask resources (aws#3723)"

This reverts commit 0a4673a.

sparrc added a commit that referenced this pull request


          Revert "Remove task serialization and use host resource manager for t…

cb54139

…ask resources (#3723)"

This reverts commit 0a4673a.

prateekchaudhry added a commit that referenced this pull request


          Remove task serialization and use host resource manager for task reso…

6bc7b20

…urces (#3723)

prateekchaudhry added a commit to prateekchaudhry/amazon-ecs-agent that referenced this pull request


          Revert "Revert "Remove task serialization and use host resource manag…

14069d0

…er for task resources (aws#3723)""

This reverts commit cb54139.

prateekchaudhry added a commit that referenced this pull request


          Revert reverted changes for task resource accounting (#3796)

96a64ef

* Revert "Revert "host resource manager initialization""

This reverts commit dafb967.

* Revert "Revert "Add method to get host resources reserved for a task (#3706)""

This reverts commit 8d824db.

* Revert "Revert "Add host resource manager methods (#3700)""

This reverts commit bec1303.

* Revert "Revert "Remove task serialization and use host resource manager for task resources (#3723)""

This reverts commit cb54139.

* Revert "Revert "add integ tests for task accounting (#3741)""

This reverts commit 61ad010.

* Revert "Revert "Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order (#3747)""

This reverts commit 60a3f42.

* Revert "Revert "Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue (#3750)""

This reverts commit 8943792.

prateekchaudhry mentioned this pull request

Merge Feature/task-resource-accounting to dev #3819

Merged

Realmonia pushed a commit that referenced this pull request


          Merge Feature/task-resource-accounting to dev (#3819)

fa4da21

* Revert reverted changes for task resource accounting (#3796)

* Revert "Revert "host resource manager initialization""

This reverts commit dafb967.

* Revert "Revert "Add method to get host resources reserved for a task (#3706)""

This reverts commit 8d824db.

* Revert "Revert "Add host resource manager methods (#3700)""

This reverts commit bec1303.

* Revert "Revert "Remove task serialization and use host resource manager for task resources (#3723)""

This reverts commit cb54139.

* Revert "Revert "add integ tests for task accounting (#3741)""

This reverts commit 61ad010.

* Revert "Revert "Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order (#3747)""

This reverts commit 60a3f42.

* Revert "Revert "Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue (#3750)""

This reverts commit 8943792.

* fix memory resource accounting for multiple containers in single task (#3782)

* fix memory resource accounting for multiple containers

* change unit tests for multiple containers, add unit test for awsvpc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet