-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Host Resource Manager initialization #3684
Host Resource Manager initialization #3684
Conversation
f119b09
to
671c580
Compare
671c580
to
8a7f922
Compare
8a7f922
to
69e7dfe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Yiyuan! The PR looks good overall.
@@ -31,14 +32,15 @@ import ( | |||
func NewTaskEngine(cfg *config.Config, client dockerapi.DockerClient, | |||
credentialsManager credentials.Manager, | |||
containerChangeEventStream *eventstream.EventStream, | |||
imageManager ImageManager, state dockerstate.TaskEngineState, | |||
imageManager ImageManager, hostResources map[string]*ecs.Resource, state dockerstate.TaskEngineState, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we initiate hostResourceManager
outside of TaskEngine
, and pass it to the engine constructor? TaskEngine
needs a hostResourceManager
object, but it doesn't (or shouldn't) need to know how it's created.
Similarly, task engine also consumes the other managers such as credentialsManager
and imageManager
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per offline discuss, will keep it as it is for now
hostResource map[string]*ecs.Resource | ||
consumedResource map[string]*ecs.Resource | ||
|
||
taskConsumed map[string]bool //task.arn to boolean whether host resources consumed or not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - if we are going to be using this map as a set (i.e. we only ever care about whether the key exists), we can use map[string]struct{}
to indicate that the value doesn't matter. But current implementation is fine too.
func NewHostResourceManager(resourceMap map[string]*ecs.Resource) HostResourceManager { | ||
consumedResourceMap := make(map[string]*ecs.Resource) | ||
taskConsumed := make(map[string]bool) | ||
// assigns CPU, MEMORY, PORTS, PORTS_UDP from host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - is it possible to make PORTS
-> PORTS_TCP
just to be clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will change it
|
||
// HostResourceManager keeps account of each task in | ||
type HostResourceManager struct { | ||
hostResource map[string]*ecs.Resource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - maybe rename to initialHostResource
if we plan to keep it immutable?
Also can we add a comment for the variable to clarify that, for resources in resourceMap
, some are "available resources" like CPU, mem, while some others are "reserved/consumed resources" like ports, so that readers don't get confused about the inconsistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Yinyi, will update it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yiyuanzzz Would it be more clearer to instead rename hostResource
as initialHostResource
and keep the struct name as HostResourceManager
? As the consumeResource
and taskConsumed
will keep changing with state?
@@ -306,6 +306,18 @@ func (client *APIECSClient) getResources() ([]*ecs.Resource, error) { | |||
return []*ecs.Resource{&cpuResource, &memResource, &portResource, &udpPortResource}, nil | |||
} | |||
|
|||
func (client *APIECSClient) GetHostResources() (map[string]*ecs.Resource, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Could we add comments to this public function? See RegisterContainerInstance as an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will update it
agent/app/agent.go
Outdated
} | ||
} | ||
|
||
hostResources, err := client.GetHostResources() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move line 317-321 before line 301-315 to make it fail fast if agent cannot get host resources?
// Get host resources
hostResources, err := client.GetHostResources()
if err != nil {
seelog.Critical("Unable to fetch host resources")
return exitcodes.ExitError
}
// Get GPU devices
numGPUs := int64(0)
if agent.cfg.GPUSupportEnabled {
err := agent.initializeGPUManager()
if err != nil {
seelog.Criticalf("Could not initialize Nvidia GPU Manager: %v", err)
return exitcodes.ExitError
}
// Find number of GPUs instance has
platformDevices := agent.getPlatformDevices()
for _, device := range platformDevices {
if *device.Type == ecs.PlatformDeviceTypeGpu {
numGPUs++
}
}
}
// Update GPU in hostResources map
hostResources["GPU"] = &ecs.Resource{
Name: utils.Strptr("GPU"),
Type: utils.Strptr("INTEGER"),
IntegerValue: &numGPUs,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me!
agent/app/agent.go
Outdated
hostResources, err := client.GetHostResources() | ||
if err != nil { | ||
seelog.Critical("Unable to fetch host resources") | ||
return exitcodes.ExitError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @prateekchaudhry offline. We should revisit this error code by referring logic here.
- ExitError: (as well as unspecified exit codes) indicates a fatal error occurred, but the agent should be restarted
- ExitTerminal: indicates the agent has exited unsuccessfully, but should not be restarted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @yinyic , we probably should use ExitError because it may be fail to read the host resource due to some temporary reasons and a retry will work.
@@ -78,6 +79,20 @@ var apiVersions = []dockerclient.DockerVersion{ | |||
dockerclient.Version_1_22, | |||
dockerclient.Version_1_23} | |||
var capabilities []*ecs.Attribute | |||
var testHostCPU = int64(1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Should they be constants instead of variables since we are not going to change them for different test cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we make those constants, then testHostCPU and testHostMEMORY are not addressable, we can not get address of them in var testHostResource = map[string]*ecs.Resource{ .....IntegerValue: &testHostCPU,}
@@ -78,6 +79,20 @@ var apiVersions = []dockerclient.DockerVersion{ | |||
dockerclient.Version_1_22, | |||
dockerclient.Version_1_23} | |||
var capabilities []*ecs.Attribute | |||
var testHostCPU = int64(1024) | |||
var testHostMEMORY = int64(1024) | |||
var testHostResource = map[string]*ecs.Resource{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q. would we have any negative test case for "get host resources" in a follow-up PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so
// permissions and limitations under the License. | ||
|
||
// Package engine contains the core logic for managing tasks | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Can we add more comments/contexts for this new manager? See agent/engine/docker_task_engine.go as an example. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Chilinn, will add more comments
//PORTS | ||
//Copying ports from host resources as consumed ports for initializing | ||
ports := []*string{} | ||
if resourceMap != nil && resourceMap["PORTS"] != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q. Could a nil resourceMap
ever be passed to here? If yes, should we returned an error instead of creating a new manager? or should we avoid using a nil resourceMap to initialize the manager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a nil resourceMap
will be passed to here when calling engine.NewTaskEngine(....nil..)
in several test files,
for example in agent/statemanager/state_manager_test.go
, taskEngine := engine.NewTaskEngine(&config.Config{}, nil, nil, nil, nil, nil, dockerstate.NewTaskEngineState(), nil, nil, nil, nil)
we add this logic to avoid nil pointer exceptions
888d2d9
to
a82e500
Compare
82d77aa
to
fb28a48
Compare
Summary
Initialize host resource manager with available host resource values. Host resource manager will keep track of the available host resources as tasks start and stop. Resources include host port, CPU, memory, and GPU. New methods and tests will be followed up in later implementations.
Implementation details
host_resource_manager.go
file inengine
for host resource manager initialization.make gogenerate
to generate mocks inapi_mocks.go
getResources
private and added newGetHostResources
method to interface.exitcodes.ExitError
exit code when agent unable to fetch host resources, agent will retry if agent fail to get host resource, if agent keep failing, agent will not be able to register container instance, customer will have some idle EC2 instances if they don’t have a proper setup to terminate these instances, but this is the same risk with or without the Agent restart behavior though.Testing
Running the static checks within the PR.
make release-agent
New tests cover the changes: no
Description for the changelog
N/A
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.