-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove task serialization and use host resource manager for task resources #3723
Remove task serialization and use host resource manager for task resources #3723
Conversation
098ee63
to
c7e3b86
Compare
949e804
to
eff922e
Compare
eff922e
to
37f1f72
Compare
// Call to release here for stopped tasks should always succeed | ||
// Idempotent release call | ||
if taskStatus.Terminal() { | ||
err := engine.hostResourceManager.release(task.Arn, resources) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will this be useful? Will it only ever be a no-op if the agent is always starting from zero?
It would help if you could offer an example of where this might be useful in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not particularly useful right now, but this is a generalized implementation for keeping HostResourceManager
in sync with engine. So this might find more uses in future, such as to keep periodic sync between engine and resource manager.
agent/engine/docker_task_engine.go
Outdated
// Consume host resources if task has progressed | ||
// Call to consume here should always succeed | ||
// Idempotent consume call | ||
if !task.IsInternal && taskStatus > apitaskstatus.TaskCreated { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to check for taskStatus == apitaskstatus.TaskCreated | TaskRunning
agent/engine/docker_task_engine.go
Outdated
waitingTaskQueueSingleLen := false | ||
engine.waitingTasksLock.Lock() | ||
waitingTaskQueueSingleLen = len(engine.waitingTaskQueue) == 1 | ||
engine.waitingTasksLock.Unlock() | ||
if waitingTaskQueueSingleLen { | ||
engine.monitorQueuedTaskEvent <- struct{}{} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite following this logic - I think it's sufficient to wake up the queue when we enqueue and when a task stops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack - making it the channel buffered and a no-op/empty default
agent/engine/docker_task_engine.go
Outdated
break | ||
} | ||
} | ||
logger.Debug("No more tasks in Waiting Task Queue, waiting for new tasks") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - no more tasks could be started at this moment
agent/engine/docker_task_engine.go
Outdated
consumable, err := engine.hostResourceManager.consumableSafe(taskHostResources) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit- redundant check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see another test or three covering the engine as an opaque box. You can add this as a follow up.
agent/engine/docker_task_engine.go
Outdated
resourcesToRelease := task.ToHostResources() | ||
err := engine.hostResourceManager.release(task.Arn, resourcesToRelease) | ||
if err != nil { | ||
logger.Critical("Failed to release resources after tast stopped", logger.Fields{field.TaskARN: task.Arn}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: small typo here "tast"
880f7ba
to
a7c2340
Compare
// Always wakes up when at least one event arrives on buffered channel monitorQueuedTaskEvent | ||
// but does not block if monitorQueuedTasks is already processing queued tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we elaborate a little more in the comment on 1) when we will be invoking this method (who will be sending messages onto the channel), and 2) why is buffer size of one sufficient (why's it okay to drop any additional messages)
// Before starting managedTask goroutines, pre-allocate resources for already running | ||
// tasks in host resource manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly "already running", more like tasks that have progressed beyond the resource consumption check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will update these in follow up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both comments are on comments, gonna approve to unblock. Please update the comments in the next PR (I'm assuming we'll have a follow-up with more/updated tests)
* Revert "Revert "host resource manager initialization"" This reverts commit dafb967. * Revert "Revert "Add method to get host resources reserved for a task (#3706)"" This reverts commit 8d824db. * Revert "Revert "Add host resource manager methods (#3700)"" This reverts commit bec1303. * Revert "Revert "Remove task serialization and use host resource manager for task resources (#3723)"" This reverts commit cb54139. * Revert "Revert "add integ tests for task accounting (#3741)"" This reverts commit 61ad010. * Revert "Revert "Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order (#3747)"" This reverts commit 60a3f42. * Revert "Revert "Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue (#3750)"" This reverts commit 8943792.
* Revert reverted changes for task resource accounting (#3796) * Revert "Revert "host resource manager initialization"" This reverts commit dafb967. * Revert "Revert "Add method to get host resources reserved for a task (#3706)"" This reverts commit 8d824db. * Revert "Revert "Add host resource manager methods (#3700)"" This reverts commit bec1303. * Revert "Revert "Remove task serialization and use host resource manager for task resources (#3723)"" This reverts commit cb54139. * Revert "Revert "add integ tests for task accounting (#3741)"" This reverts commit 61ad010. * Revert "Revert "Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order (#3747)"" This reverts commit 60a3f42. * Revert "Revert "Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue (#3750)"" This reverts commit 8943792. * fix memory resource accounting for multiple containers in single task (#3782) * fix memory resource accounting for multiple containers * change unit tests for multiple containers, add unit test for awsvpc
Summary
ECS Agent ensures host resources to be available on the instance before running a task. Currently this is implemented through
serialization
- when scheduling a task, agent waits for all previously stopping tasks, i.e. withStopSequenceNumber
of payload from acs less thanseqnum
of payload of the requested task, to stop.This PR removes this serialization behavior and instead uses
HostResourceManager
to schedule tasks and a FIFO task queue built intodocker_task_engine
to queue tasks instead. This will hence use cpu, memory, ports(tcp/udp) and number of gpus available to manage tasks, and start progressing tasks as soon as resources for them start becoming available - instead of all stopping tasks to stop.Implementation details
Removes package
sequential_waitgroup
, and references related toStartSequenceNumber
andStopSequenceNumber
which are constructs related to task serializationTasks get queued in a
waitingTaskQueue
and wait for host resources (managed throughHostResourceManager
) to become available. A goroutinemonitorQueuedTasks
dequeues and starts waking up each of the waiting the tasks as and when resources start becoming available. When it can not dequeue anymore because resources are not available, it waitsWhen a task stops or when a new task arrives, it wakes up the
monitorQueuedTasks
in case it is blockedManagement of host resources When a task gets resources accounted for by the
monitorQueuedTasks
, resources are consumed. When a task changes it knownStatus toSTOPPED
and emits a change of state, resources are releasedFor Agent restarts, there is a
reconcileHostResources
implemented duringsynchronizeState
which synchronizesHostResourceManager
data structures according to known task states. If any container has been known to progressed beyondContainerStatusNone
state, then host resources are consumed.Related PRs
Related Containers Roadmap Issue
aws/containers-roadmap#325
Testing
Manually tested reconciliation behavior with agent restarts and verified resources are allocated correctly from agent logs
New tests cover the changes: Yes
TestTaskWaitForHostResources
unit test to test task queueing/dequeuingDescription for the changelog
Remove task serialization and use host resource manager for task resources
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.