Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Policies for ecs_service, ecs_cluster, ecs_taskdefinition, ecs_task #210

Closed
wants to merge 15 commits into from

Conversation

alinabuzachis
Copy link
Collaborator

@alinabuzachis alinabuzachis commented May 13, 2022

Policies for ecs_service, ecs_cluster, ecs_taskdefinition, ecs_task

Needed for ansible-collections/community.aws#1145

@alinabuzachis alinabuzachis force-pushed the ecs_policies branch 6 times, most recently from 888e6ac to def7593 Compare May 13, 2022 17:47
@alinabuzachis alinabuzachis changed the title [WIP] Policies for ecs_service, ecs_cluster, ecs_taskdefinition, ecs_task Policies for ecs_service, ecs_cluster, ecs_taskdefinition, ecs_task May 16, 2022
Comment on lines 697 to 699
self.client.delete_service(cluster=self.name, service=name['serviceName'], force=True)

self.client.delete_cluster(cluster=self.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.client.delete_service(cluster=self.name, service=name['serviceName'], force=True)
self.client.delete_cluster(cluster=self.name)
self.client.delete_service(cluster=self.name, service=name['serviceName'], force=True)
self.client.delete_cluster(cluster=self.name)

do we need here a waiter? https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ecs.html#ECS.Waiter.ServicesInactive
otherwise delete_cluster might fail. Or does it not matter, because the next run of the terminator will delete the cluster, because the service is gone then?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would typically avoid using waiters on the lambda to prevent timeouts, especially for anything that might take more than a minute or two. The ideal way to handle a scenario like this would be to create terminators for the cluster and any dependent resources, and increase the age limit on the cluster terminator. This is complicated by the fact that doing anything with services or container instances also requires the cluster name.

Comment on lines 688 to 690
tasks = _paginate_task_results(name['containerInstanceArn'])
for task in tasks:
self.client.stop_task(cluster=self.name, task=task['taskArn'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tasks = _paginate_task_results(name['containerInstanceArn'])
for task in tasks:
self.client.stop_task(cluster=self.name, task=task['taskArn'])
tasks = _paginate_task_results(name['containerInstanceArn'])
for task in tasks:
self.client.stop_task(cluster=self.name, task=task['taskArn'])

Unless the desired count of the belonging service is not set to 0, the service might spawn new tasks to replace the stopped ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe flipping solves this issue in a lazy way?

  1. delete the service
  2. delete all left running tasks (like killall -9)
  3. delete the cluster

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the tasks need to be individually deleted at all? Won't deregistering the container instance also remove the tasks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ecs_tasks runs on ec2, yes. deleteing the instance, kills also the ecs tasks.
but if it's serverless (fargate), there is no otherway to stop/kill the tasks.

but basically you did it that way, it you won't wait for the service to become inactive.

imo we've got two choices.
a) delete service and wait until it's gone. forward delete cluster
b) delete service, kill pending tasks, forward delete cluster

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the first option seems like less work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wouldn't want to ignore the exception forever though. So perhaps, one class with the default age_limit that catches and ignores ClusterContainsServicesException and ClusterContainsTasksException on delete_cluster(). Then a second class, which has an age_limit 10 minutes longer than the default, that only tries to delete clusters (and that one doesn't ignore those exceptions, so that errors still surface to us in the monitoring).

I'm assuming 10 minutes should be more than long enough, even on a bad day, for all tasks and services to be 100% gone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markuman I tried to create the second class, @markuman can you check please and let me know if this is what you meant? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
What's about the failing CI. I cannot find any usefull error message.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markuman it needs to be refactored in some way because the file's dimension exceeds the limit. Will try to see what I can do for this thing.

@gravesm
Copy link
Collaborator

gravesm commented Jun 7, 2022

Since we've exceeded the policy size on compute, I would suggest moving the ECS policies to the paas policy.

@alinabuzachis alinabuzachis force-pushed the ecs_policies branch 2 times, most recently from 951959f to ee500ed Compare June 7, 2022 15:23
@alinabuzachis
Copy link
Collaborator Author

Since we've exceeded the policy size on compute, I would suggest moving the ECS policies to the paas policy.

Done, thank you @gravesm!

self.client.stop_task(cluster=self.name, task=task['taskArn'])

# If there are running services, delete them first
services = _paginate_service_results()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Services should be the first things that are deleted, otherwise, any deleted tasks may just be restarted again.

# Deregister container instances
for name in container_instances:
self.client.deregister_container_instance(containerInstance=name['containerInstanceArn'])

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before deleting the cluster, you will also need to delete all the tasks. In the case of fargate, there are no container instances and you may have tasks running that aren't managed by a service. I'm also not clear on whether using force=True when deleting a service also deletes the tasks or just orphans them. Either way, there should be a step that stops all tasks.

Copy link
Collaborator

@gravesm gravesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple small fixes, but otherwise this LGTM.

if not names:
return []

return client.describe_clusters(name=names)['clusters']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return client.describe_clusters(name=names)['clusters']
return client.describe_clusters(clusters=names)['clusters']

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Thanks.

if not names:
return []

return client.describe_clusters(name=names)['clusters']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return client.describe_clusters(name=names)['clusters']
return client.describe_clusters(clusters=names)['clusters']

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Thanks.

@tremble
Copy link
Contributor

tremble commented Feb 3, 2023

@alinabuzachis @gravesm Where do we stand on getting this merged?

Signed-off-by: Alina Buzachis <[email protected]>
Signed-off-by: Alina Buzachis <[email protected]>
Signed-off-by: Alina Buzachis <[email protected]>
Signed-off-by: Alina Buzachis <[email protected]>
Signed-off-by: Alina Buzachis <[email protected]>
Signed-off-by: Alina Buzachis <[email protected]>
Signed-off-by: Alina Buzachis <[email protected]>
Signed-off-by: Alina Buzachis <[email protected]>
Signed-off-by: Alina Buzachis <[email protected]>
Signed-off-by: Alina Buzachis <[email protected]>
Signed-off-by: Alina Buzachis <[email protected]>
Signed-off-by: Alina Buzachis <[email protected]>
@gravesm
Copy link
Collaborator

gravesm commented Feb 3, 2023

Hey, I'll try to get to this early next week. There are a lot of parts covered by this PR and I'd like to set aside some time to test it.

@markuman
Copy link
Contributor

markuman commented Feb 5, 2023

Just a note, the ecs_cluster integration test is currently broken.

@gravesm
Copy link
Collaborator

gravesm commented Feb 6, 2023

@tremble I'm running into a lot of problems with the ecs_cluster tests. There may be other things that are broken, as @markuman referred to, but the biggest problem I have so far is that the tests seem to be intended to provision containers on EC2 instances, but then the service definitions are using Fargate provisioning.

What I need in order to get this merged is a working test suite. I'm willing to work on getting the permissions correct, but I need a PR I can test against that is otherwise working.

@markuman
Copy link
Contributor

markuman commented Feb 6, 2023

What I need in order to get this merged is a working test suite. I'm willing to work on getting the permissions correct, but I need a PR I can test against that is otherwise working.

I can try to target it this week.

@gravesm
Copy link
Collaborator

gravesm commented Feb 23, 2023

Closing this as it was superseded by #264

@gravesm gravesm closed this Feb 23, 2023
softwarefactory-project-zuul bot pushed a commit to ansible-collections/community.aws that referenced this pull request Mar 1, 2023
ecs: integration test and new purge parameters

SUMMARY

Make the ecs_cluster integration test work again
ecs_service - new parameter purge_placement_constraints and purge_placement_strategy. Otherwise it is impossible to remove those placements without breaking backwards compatibility.

purge_placement_constraints in the integration test
purge_placement_strategy in the integration test


required by mattclay/aws-terminator#210 (comment)

ISSUE TYPE


Bugfix Pull Request
Docs Pull Request
Feature Pull Request

COMPONENT NAME
ecs_service
ADDITIONAL INFORMATION
works for me again
ansible-test integration --python 3.10 ecs_cluster --docker --allow-unsupported
...
PLAY RECAP *********************************************************************
testhost                   : ok=143  changed=69   unreachable=0    failed=0    skipped=1    rescued=0    ignored=6

Reviewed-by: Mark Chappell
Reviewed-by: Markus Bergholz <[email protected]>
Reviewed-by: Alina Buzachis
Reviewed-by: Mike Graves <[email protected]>
patchback bot pushed a commit to ansible-collections/community.aws that referenced this pull request Mar 1, 2023
ecs: integration test and new purge parameters

SUMMARY

Make the ecs_cluster integration test work again
ecs_service - new parameter purge_placement_constraints and purge_placement_strategy. Otherwise it is impossible to remove those placements without breaking backwards compatibility.

purge_placement_constraints in the integration test
purge_placement_strategy in the integration test

required by mattclay/aws-terminator#210 (comment)

ISSUE TYPE

Bugfix Pull Request
Docs Pull Request
Feature Pull Request

COMPONENT NAME
ecs_service
ADDITIONAL INFORMATION
works for me again
ansible-test integration --python 3.10 ecs_cluster --docker --allow-unsupported
...
PLAY RECAP *********************************************************************
testhost                   : ok=143  changed=69   unreachable=0    failed=0    skipped=1    rescued=0    ignored=6

Reviewed-by: Mark Chappell
Reviewed-by: Markus Bergholz <[email protected]>
Reviewed-by: Alina Buzachis
Reviewed-by: Mike Graves <[email protected]>
(cherry picked from commit 86c60b4)
softwarefactory-project-zuul bot pushed a commit to ansible-collections/community.aws that referenced this pull request Mar 1, 2023
[PR #1716/86c60b49 backport][stable-5] ecs: integration test and new purge parameters

This is a backport of PR #1716 as merged into main (86c60b4).
SUMMARY

Make the ecs_cluster integration test work again
ecs_service - new parameter purge_placement_constraints and purge_placement_strategy. Otherwise it is impossible to remove those placements without breaking backwards compatibility.

purge_placement_constraints in the integration test
purge_placement_strategy in the integration test


required by mattclay/aws-terminator#210 (comment)

ISSUE TYPE


Bugfix Pull Request
Docs Pull Request
Feature Pull Request

COMPONENT NAME
ecs_service
ADDITIONAL INFORMATION
works for me again
ansible-test integration --python 3.10 ecs_cluster --docker --allow-unsupported
...
PLAY RECAP *********************************************************************
testhost                   : ok=143  changed=69   unreachable=0    failed=0    skipped=1    rescued=0    ignored=6

Reviewed-by: Mark Chappell
abikouo pushed a commit to abikouo/amazon.aws that referenced this pull request Sep 18, 2023
)

ecs: integration test and new purge parameters

SUMMARY

Make the ecs_cluster integration test work again
ecs_service - new parameter purge_placement_constraints and purge_placement_strategy. Otherwise it is impossible to remove those placements without breaking backwards compatibility.

purge_placement_constraints in the integration test
purge_placement_strategy in the integration test


required by mattclay/aws-terminator#210 (comment)

ISSUE TYPE


Bugfix Pull Request
Docs Pull Request
Feature Pull Request

COMPONENT NAME
ecs_service
ADDITIONAL INFORMATION
works for me again
ansible-test integration --python 3.10 ecs_cluster --docker --allow-unsupported
...
PLAY RECAP *********************************************************************
testhost                   : ok=143  changed=69   unreachable=0    failed=0    skipped=1    rescued=0    ignored=6

Reviewed-by: Mark Chappell
Reviewed-by: Markus Bergholz <[email protected]>
Reviewed-by: Alina Buzachis
Reviewed-by: Mike Graves <[email protected]>
abikouo pushed a commit to abikouo/amazon.aws that referenced this pull request Sep 18, 2023
)

ecs: integration test and new purge parameters

SUMMARY

Make the ecs_cluster integration test work again
ecs_service - new parameter purge_placement_constraints and purge_placement_strategy. Otherwise it is impossible to remove those placements without breaking backwards compatibility.

purge_placement_constraints in the integration test
purge_placement_strategy in the integration test


required by mattclay/aws-terminator#210 (comment)

ISSUE TYPE


Bugfix Pull Request
Docs Pull Request
Feature Pull Request

COMPONENT NAME
ecs_service
ADDITIONAL INFORMATION
works for me again
ansible-test integration --python 3.10 ecs_cluster --docker --allow-unsupported
...
PLAY RECAP *********************************************************************
testhost                   : ok=143  changed=69   unreachable=0    failed=0    skipped=1    rescued=0    ignored=6

Reviewed-by: Mark Chappell
Reviewed-by: Markus Bergholz <[email protected]>
Reviewed-by: Alina Buzachis
Reviewed-by: Mike Graves <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants