Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address envFile resource naming defect #3554

Merged
merged 1 commit into from
Feb 6, 2023
Merged

Conversation

danehlim
Copy link
Contributor

@danehlim danehlim commented Jan 31, 2023

Summary

Address EnvironmentFileResource naming defect which causes a race condition when a task has multiple containers which specify at least one unique environment file.

For further context, please refer to the "Additional Context" section below in this pull request's description.

Implementation details

  • Modify EnvironmentFileResource GetName() method to easily differentiate EnvironmentFileResource objects from one another (i.e., add container name to return value since container name is unique)
  • Use resourcestatus.ResourceStatus(envFiles.EnvFileCreated) instead of resourcestatus.ResourceCreated when building a container's EnvironmentFileResource dependency in the initializeEnvfilesResource function
  • Update existing unit tests to account for new changes

Testing

Unit testing:

  • make test

Existing unit test TestInitializeAndGetEnvfilesResource has been expanded to cover the changes introduced in this pull request and operates as follows:

  1. Creates a task specifying two containers, with each container specifying its own single unique environment file
  2. Initializes each containers’ EnvironmentFileResource (which specifies an EnvironmentFileResource dependency for each container)
  3. Check that the specified resource dependency for each container is defined as expected
  4. Make sure that the resource dependency of the first container is not equal to the resource dependency of the second container

This test will always pass when the fix in this pull request is implemented since the two resource dependencies will never be equal due to each EnvironmentFileResource object’s call to its GetName() method returning a unique name. This test will also always fail under the current ECS Agent because the two resource dependencies will always be considered equal due to all calls to EnvironmentFileResource GetName() method returning the constant string “envfile”.

Manual testing:

An ECS customer who faced this issue was given a custom ECS Agent with the code changes from this PR built in and validated that they no longer ran into the issue after using that custom ECS Agent.

I was also able to reproduce the issue using the current ECS Agent by taking the following steps:

  1. defining a task which has 2 nginx containers where one container specifies a small (<20 KB) environment file and the other container specifies a large (>500 MB) environment file
  2. then trying to run 10 such tasks (and subsequently observing some are not able to start due to the race condition coming into play)

Upon using the custom ECS Agent mentioned previously, I could no longer reproduce the issue with the same steps.

New tests cover the changes: no (but existing TestInitializeAndGetEnvfilesResource test has been modified to cover the changes)

Additional Context

Per the Amazon ECS Developer Guide, each container in a given task can specify up to 10 environment files. If a container specifies 1 or more environment files, then all of its environment files are grouped into a single EnvironmentFileResource. So for each container with 1 or more environment files, an EnvironmentFileResource is created.

A problem arises in the current ECS Agent code when a task has multiple (i.e., more than one) containers which specify at least one unique environment file.

When EnvironmentFileResource objects are initialized in the initializeEnvfilesResource function, each EnvironmentFileResource (tied to a unique container in the task) is:

  1. added to the task’s ResourcesMap’s slice of task resource objects mapped from “envfile” key
  2. added as part of a ResourceDependency for the container it corresponds to, such that the container cannot be transitioned into a “CREATED” status until task resource with resource name “envfile” has KnownStatus “CREATED”

The 2nd point above is problematic. Every EnvironmentFileResource object’s GetName() method always returns “envfile”, which means all EnvironmentFileResource objects have the same name.

Then later, when we check if dependencies for a container are resolved, we pass the slice of task resources returned by mtask.GetResources() to the DependenciesAreResolved function.

Diving deeper into mtask.GetResources() reveals that it calls task.getResourcesUnsafe(), which iterates through the task’s resource map. According to the Go Programming Language Specification, “the iteration order over maps is not specified and is not guaranteed to be the same from one iteration to the next”. Thus, there is no guarantee as to how the task resources will be ordered in the slice of task resources that we pass to the DependenciesAreResolved function.

Recall that all EnvironmentFileResource objects have the same name. Because of this, observe that in the DependenciesAreResolved function, only the last EnvironmentFileResource encountered while iterating through the slice of task resources returned by the previous mtask.GetResources() call (this EnvironmentFileResource could correspond to any container in the task) will be the one we check if it has KnownStatus CREATED” yet, regardless of whether or not that EnvironmentFileResource is actually associated with the current container. That is, only the last EnvironmentFileResource encountered can occupy this map at key value “envfile”.

Thus, even though we have declared EnvironmentFileResource dependencies for containers, there is no guarantee that a container will try to transition to “CREATED” only after its associated environment files have finished downloading and been written to disk.

Simplified Example:

Let’s say we have a task that specifies 2 containers, container A and container B where each container specifies its own single unique environment file and the following sequence of events happens:

  1. Container A’s EnvironmentFileResource and Container B’s EnvironmentFileResource are initialized
  2. Container A’s environment file starts downloading
  3. Container B’s environment file starts downloading
  4. Container A’s environment file finishes downloading and has been written to disk
  5. Container A’s EnvironmentFileResource transitions to KnownStatus “CREATED” since all of its environment files have finished downloading
  6. There is a request to transition Container B to “CREATED”, but first Agent must check if Container B’s dependencies are resolved
    1. Get the task’s resources from task’s resource map and store them in a slice (ordering of how task resources get added to the slice NOT guaranteed)
    2. Iterate through the slice above, but only the last EnvironmentFileResource encountered is selected
    3. Let’s say it just so happens the last EnvironmentFileResource encountered is Container A’s EnvironmentFileResource
    4. Agent checks the selected EnvironmentFileResource to see if it has KnownStatus “CREATED”
    5. It sees it has KnownStatus “CREATED”, so the result of this check does not block Container B from transitioning to “CREATED”
  7. Agent begins transitioning Container B to “CREATED”
    1. Part of this transition includes reading from Container B’s environment file, which has NOT yet finished downloading yet
  8. “Unable to open environment file” error is thrown
  9. Container B’s environment file finishes downloading and has been written to disk, but it’s too late already

Description for the changelog

Address envFile resource naming defect

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@danehlim danehlim marked this pull request as ready for review January 31, 2023 22:56
@danehlim danehlim requested a review from a team as a code owner January 31, 2023 22:56
@danehlim danehlim marked this pull request as draft January 31, 2023 22:57
@danehlim danehlim marked this pull request as ready for review February 1, 2023 18:58
agent/api/task/task_test.go Show resolved Hide resolved
agent/api/task/task.go Show resolved Hide resolved
yinyic
yinyic previously approved these changes Feb 2, 2023
YashdalfTheGray
YashdalfTheGray previously approved these changes Feb 3, 2023
Copy link
Contributor

@YashdalfTheGray YashdalfTheGray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent writeup for the defect, helped me get a lot of context. Changes look good.

@danehlim
Copy link
Contributor Author

danehlim commented Feb 3, 2023

Previous push is just to rebase and apply newest commits from dev branch.

@danehlim
Copy link
Contributor Author

danehlim commented Feb 6, 2023

The following tests are known to currently hang on pending status.

ecs/al2kernel5dot10/integ_test
ecs/al2kernel5dot10arm/integ_test
ecs/al2kernel5dot10arm/unit_test

@Yiyuanzzz is working on a fix and it will be deployed in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants