Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/destroy/aws: Use the resource-groups service for tag->ARN lookup #1039

Merged
merged 2 commits into from
Jan 22, 2019

Conversation

wking
Copy link
Member

@wking wking commented Jan 10, 2019

Buiilds on #1038, review that first.

I tried resourcegroups as well, but it doesn't support IAM roles or hosted zones:

FATAL Failed to destroy cluster: {"ResourceTypeFilters":["AWS::EC2::Instance","AWS::EC2::InternetGateway","AWS::EC2::NatGateway","AWS::EC2::RouteTable","AWS::EC2::SecurityGroup","AWS::EC2::Subnet","AWS::EC2::VPC","AWS::EC2::VPNGateway","AWS::Route53::HostedZone","AWS::S3::Bucket"],"TagFilters":[{"Key":"openshiftClusterID","Values":["609cbd6b-af81-4278-9285-30314d973020"]}]}: BadRequestException: Query not valid: One or more resource types are not valid: AWS::Route53::HostedZone
                     status code: 400, request id: a1388303-14a7-11e9-b813-d172f51a0eff

So with this commit I've gone with resourcegroupstaggingapi. Ingress has been doing something similar since openshift/cluster-ingress-operator@8fdcf1a0 (openshift/cluster-ingress-operator#93), so we no longer need to worry about having insufficient permissions for this action.

This will conflict with #1031 (and likely anything else touching AWS destruction), CC @dgoodwin.

For previous discussion of Route 53 quota issues see #957 and #981.

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 10, 2019
@dgoodwin
Copy link
Contributor

Where does this leave #1031 ? Do I need to reimplement? Kind of terrible timing for that. :)

@dgoodwin
Copy link
Contributor

CC @joelddiaz

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 14, 2019
@joelddiaz
Copy link
Contributor

@wking one thing to look out for is the case where we delete an instance and the cluster starts creating new instances before we terminate the instance running the MCO. From a glance at the looping over tagClient[] it wasn't clear to me that the code would double-check that there truly are no more instances to delete after it has gone through the initial results of the getresourcespages() call (FWIW, I'm not at all familiar with getresourcepages).

I remember getting strange hangs during uninstall because we were racing to delete instances faster than the cluster can re-create them (that's why deleteEC2Instance() was modified to also search for 'pending' state instances).

HTH

@wking
Copy link
Member Author

wking commented Jan 15, 2019

From a glance at the looping over tagClient[] it wasn't clear to me that the code would double-check...

That's what this FIXME is for ;).

@cgwalters
Copy link
Member

one thing to look out for is the case where we delete an instance and the cluster starts creating new instances before we terminate the instance running the MCO.

You mean MAO i.e. machine-api-operator.

But yeah...shouldn't the destroy command try to go in and neuter the MAO in a cluster if it's running?

@wking
Copy link
Member Author

wking commented Jan 18, 2019

But yeah...shouldn't the destroy command try to go in and neuter the MAO in a cluster if it's running?

It could. And it could also try to tune MachineSets to zero replicas, disable autoscalers, remove routes, etc. But we need deletion to be robust even if the cluster is borked, and once we have that, these sort of gentle-wind-down approaches seem unnecessary. But patches welcome, if it bothers folks too much to want to wait until we have time to add them. Until then, the destroy logic just has to kill the cluster faster than it can self-heal ;).

@wking wking force-pushed the tag-search-destroy branch from 1258fd5 to eb5d7e0 Compare January 19, 2019 07:19
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 19, 2019
@wking wking changed the title WIP: pkg/destroy/aws: Use the resource-groups service for tag->ARN lookup pkg/destroy/aws: Use the resource-groups service for tag->ARN lookup Jan 19, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 19, 2019
@wking
Copy link
Member Author

wking commented Jan 19, 2019

Until then, the destroy logic just has to kill the cluster faster than it can self-heal ;).

Thinking this over some more, how quickly is the new credentials operator noticing and recovering from user deletion? If this destroy script goes through and deletes the operator-created users early in the process, the cluster-API providers will no longer be authorized to create new machines, the ingress operator will no longer be authorized to create load balancers, etc. We should be able to freeze self-management by removing those users. And if the credentials operator makes itself a new, per-cluster user for creating sub-users, then this destroy code looping through and deleting per-cluster IAM users would definitely freeze everything. Thoughts, @dgoodwin?

@wking wking force-pushed the tag-search-destroy branch 3 times, most recently from 8fc1af0 to 5a2809a Compare January 19, 2019 10:56
@wking
Copy link
Member Author

wking commented Jan 19, 2019

Logs from the successful teardown, as seen by users at the default info log level:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1039/pull-ci-openshift-installer-master-e2e-aws/3043/artifacts/e2e-aws/container-logs/teardown.log.gz | gunzip | grep -A100 'Deprovisioning'
Deprovisioning cluster ...
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-08d54d07c6377bb60" id=rtbassoc-0f10b91d5820124c7
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-08d54d07c6377bb60" id=rtb-08d54d07c6377bb60
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-07606cb62583edfdf" id=rtbassoc-004589319449dabae
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-07606cb62583edfdf" id=rtb-07606cb62583edfdf
level=info msg=Disassociated IAM instance profile="arn:aws:iam::460538899914:instance-profile/ci-op-w01cqvnw-1d3f3-master-profile" arn="arn:aws:ec2:us-east-1:460538899914:instance/i-07f7b16c957a863ae" id=i-07f7b16c957a863ae name=ci-op-w01cqvnw-1d3f3-master-profile role=ci-op-w01cqvnw-1d3f3-master-role
level=info msg=Deleted IAM instance profile="arn:aws:iam::460538899914:instance-profile/ci-op-w01cqvnw-1d3f3-master-profile" arn="arn:aws:ec2:us-east-1:460538899914:instance/i-07f7b16c957a863ae" id=i-07f7b16c957a863ae name=ci-op-w01cqvnw-1d3f3-master-profile
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-07f7b16c957a863ae" id=i-07f7b16c957a863ae
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-0f15e4df9edf7e7b3" id=rtbassoc-03063b5a8fba5ad73
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-0f15e4df9edf7e7b3" id=rtb-0f15e4df9edf7e7b3
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-0f406bcd82b0b270a" id=rtbassoc-0706552e38d310167
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-0f406bcd82b0b270a" id=rtbassoc-0fafd82b0ae07e1d7
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-0f406bcd82b0b270a" id=rtbassoc-07759e5b53ac430b8
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-0f406bcd82b0b270a" id=rtbassoc-01c0edee4bfdbf646
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-0f406bcd82b0b270a" id=rtbassoc-04d2f35954cec0880
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-0f406bcd82b0b270a" id=rtbassoc-0d6b9e341743cc5e7
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-05333e0adc4b369d5" id=i-05333e0adc4b369d5
level=info msg=Disassociated IAM instance profile="arn:aws:iam::460538899914:instance-profile/ci-op-w01cqvnw-1d3f3-worker-profile" arn="arn:aws:ec2:us-east-1:460538899914:instance/i-0b8aef30b2b85a2c9" id=i-0b8aef30b2b85a2c9 name=ci-op-w01cqvnw-1d3f3-worker-profile role=ci-op-w01cqvnw-1d3f3-worker-role
level=info msg=Deleted IAM instance profile="arn:aws:iam::460538899914:instance-profile/ci-op-w01cqvnw-1d3f3-worker-profile" arn="arn:aws:ec2:us-east-1:460538899914:instance/i-0b8aef30b2b85a2c9" id=i-0b8aef30b2b85a2c9 name=ci-op-w01cqvnw-1d3f3-worker-profile
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-0b8aef30b2b85a2c9" id=i-0b8aef30b2b85a2c9
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-0ef7376ee975d7c12" id=rtbassoc-01b934575f14b766b
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-0ef7376ee975d7c12" id=rtb-0ef7376ee975d7c12
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-05487001a07a5b5be" id=i-05487001a07a5b5be
level=info msg=Deleted arn="arn:aws:route53:::hostedzone/ZAXM8N9Z96NJ3" id=ZAXM8N9Z96NJ3 public zone=/hostedzone/Z328TAU66YDJH7 record set="A ci-op-w01cqvnw-1d3f3-api.origin-ci-int-aws.dev.rhcloud.com."
level=info msg=Deleted arn="arn:aws:route53:::hostedzone/ZAXM8N9Z96NJ3" id=ZAXM8N9Z96NJ3 record set="A ci-op-w01cqvnw-1d3f3-api.origin-ci-int-aws.dev.rhcloud.com."
level=info msg=Deleted arn="arn:aws:route53:::hostedzone/ZAXM8N9Z96NJ3" id=ZAXM8N9Z96NJ3 record set="A ci-op-w01cqvnw-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com."
level=info msg=Deleted arn="arn:aws:route53:::hostedzone/ZAXM8N9Z96NJ3" id=ZAXM8N9Z96NJ3 record set="A ci-op-w01cqvnw-1d3f3-etcd-1.origin-ci-int-aws.dev.rhcloud.com."
level=info msg=Deleted arn="arn:aws:route53:::hostedzone/ZAXM8N9Z96NJ3" id=ZAXM8N9Z96NJ3 record set="A ci-op-w01cqvnw-1d3f3-etcd-2.origin-ci-int-aws.dev.rhcloud.com."
level=info msg=Deleted arn="arn:aws:route53:::hostedzone/ZAXM8N9Z96NJ3" id=ZAXM8N9Z96NJ3 record set="SRV _etcd-server-ssl._tcp.ci-op-w01cqvnw-1d3f3.origin-ci-int-aws.dev.rhcloud.com."
level=info msg=Deleted arn="arn:aws:route53:::hostedzone/ZAXM8N9Z96NJ3" id=ZAXM8N9Z96NJ3 public zone=/hostedzone/Z328TAU66YDJH7 record set="A \\052.apps.ci-op-w01cqvnw-1d3f3.origin-ci-int-aws.dev.rhcloud.com."
level=info msg=Deleted arn="arn:aws:route53:::hostedzone/ZAXM8N9Z96NJ3" id=ZAXM8N9Z96NJ3 record set="A \\052.apps.ci-op-w01cqvnw-1d3f3.origin-ci-int-aws.dev.rhcloud.com."
level=info msg=Deleted arn="arn:aws:route53:::hostedzone/ZAXM8N9Z96NJ3" id=ZAXM8N9Z96NJ3
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-0777455b0434571a8" id=i-0777455b0434571a8
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" classic load balancer=aebc687541bdc11e9a1e20e71c77f5cc id=vpc-08956078fafcbdec4
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" id=vpc-08956078fafcbdec4 load balancer=loadbalancer/net/ci-op-w01cqvnw-1d3f3-int/5b495b87ac6b6e52
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" id=vpc-08956078fafcbdec4 load balancer=loadbalancer/net/ci-op-w01cqvnw-1d3f3-ext/ce6607e077507783
level=info msg=Deleted VPC endpoint=vpce-045b9ea5642409aff arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" id=vpc-08956078fafcbdec4
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" id=vpc-08956078fafcbdec4 table=rtb-05d72d70d343560ce
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" id=rtbassoc-05e90193769663473 table=rtb-07fd0d17bf60c08f7
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" id=vpc-08956078fafcbdec4 table=rtb-07fd0d17bf60c08f7
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" id=rtbassoc-010d76b6ab4efe1e6 table=rtb-0e73c4635365e8141
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" id=vpc-08956078fafcbdec4 table=rtb-0e73c4635365e8141
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-0c553c581c0c05939" id=subnet-0c553c581c0c05939
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-06416a72fbd020c37" id=subnet-06416a72fbd020c37
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:natgateway/nat-00a6de3c0f31ba2ea" id=nat-00a6de3c0f31ba2ea
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:natgateway/nat-0317ccd8c1bee5b7b" id=nat-0317ccd8c1bee5b7b
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:natgateway/nat-0362f03c7b460dfc8" id=nat-0362f03c7b460dfc8
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:natgateway/nat-089a90316703bf8cd" id=nat-089a90316703bf8cd
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:natgateway/nat-0c22fa1228a07ce20" id=nat-0c22fa1228a07ce20
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:natgateway/nat-0c442b564a8c666d2" id=nat-0c442b564a8c666d2
level=info msg=Deleted arn="arn:aws:elasticloadbalancing:us-east-1:460538899914:targetgroup/ci-op-w01cqvnw-1d3f3-api-ext/8b8380f21e21a8fe" id=ci-op-w01cqvnw-1d3f3-api-ext/8b8380f21e21a8fe
level=info msg=Deleted arn="arn:aws:elasticloadbalancing:us-east-1:460538899914:targetgroup/ci-op-w01cqvnw-1d3f3-api-int/2cbd81986eb4d9d6" id=ci-op-w01cqvnw-1d3f3-api-int/2cbd81986eb4d9d6
level=info msg=Deleted arn="arn:aws:elasticloadbalancing:us-east-1:460538899914:targetgroup/ci-op-w01cqvnw-1d3f3-services/5936e74f37b60521" id=ci-op-w01cqvnw-1d3f3-services/5936e74f37b60521
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-0df56db9b6ea4838d" id=i-0df56db9b6ea4838d
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-0e6baced9303860a6" id=subnet-0e6baced9303860a6
level=info msg=Deleted arn="arn:aws:s3:::image-registry-us-east-1-55206167869847a997638fcd23075f5f-ecc7"
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:security-group/sg-066bb64da56410b9d" id=sg-066bb64da56410b9d
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:security-group/sg-0ec4ea796679549cc" id=sg-0ec4ea796679549cc
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:security-group/sg-036b0d0a136290d41" id=sg-036b0d0a136290d41
level=info msg=Deleted arn="arn:aws:iam::460538899914:role/ci-op-w01cqvnw-1d3f3-master-role" id=ci-op-w01cqvnw-1d3f3-master-role name=ci-op-w01cqvnw-1d3f3-master-role policy=ci-op-w01cqvnw-1d3f3_master_policy
level=info msg=Deleted arn="arn:aws:iam::460538899914:role/ci-op-w01cqvnw-1d3f3-master-role" id=ci-op-w01cqvnw-1d3f3-master-role name=ci-op-w01cqvnw-1d3f3-master-role
level=info msg=Deleted arn="arn:aws:iam::460538899914:role/ci-op-w01cqvnw-1d3f3-worker-role" id=ci-op-w01cqvnw-1d3f3-worker-role name=ci-op-w01cqvnw-1d3f3-worker-role policy=ci-op-w01cqvnw-1d3f3_worker_policy
level=info msg=Deleted arn="arn:aws:iam::460538899914:role/ci-op-w01cqvnw-1d3f3-worker-role" id=ci-op-w01cqvnw-1d3f3-worker-role name=ci-op-w01cqvnw-1d3f3-worker-role
level=info msg=Deleted arn="arn:aws:iam::460538899914:user/cloud-credential-operator-iam-ro-fwkzb" id=cloud-credential-operator-iam-ro-fwkzb policy=cloud-credential-operator-iam-ro-fwkzb-policy
level=info msg=Deleted arn="arn:aws:iam::460538899914:user/cloud-credential-operator-iam-ro-fwkzb" id=cloud-credential-operator-iam-ro-fwkzb
level=info msg=Deleted arn="arn:aws:iam::460538899914:user/openshift-image-registry-q6p2t" id=openshift-image-registry-q6p2t policy=openshift-image-registry-q6p2t-policy
level=info msg=Deleted arn="arn:aws:iam::460538899914:user/openshift-image-registry-q6p2t" id=openshift-image-registry-q6p2t
level=info msg=Deleted arn="arn:aws:iam::460538899914:user/openshift-ingress-mc9xp" id=openshift-ingress-mc9xp policy=openshift-ingress-mc9xp-policy
level=info msg=Deleted arn="arn:aws:iam::460538899914:user/openshift-ingress-mc9xp" id=openshift-ingress-mc9xp
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:volume/vol-0a62e6cc8a74478dd" id=vol-0a62e6cc8a74478dd
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:volume/vol-059083b04f4e4c2c7" id=vol-059083b04f4e4c2c7
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:volume/vol-05b1b3e362292a67b" id=vol-05b1b3e362292a67b
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:volume/vol-0d859f7e6e4690e39" id=vol-0d859f7e6e4690e39
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-0062a468064709bfd" id=subnet-0062a468064709bfd
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-0debe524c9af7eee1" id=subnet-0debe524c9af7eee1
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:volume/vol-028bb2f0fae8682dd" id=vol-028bb2f0fae8682dd
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:volume/vol-065c16f902036b5ab" id=vol-065c16f902036b5ab
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:security-group/sg-05121239b7ccf1425" id=sg-05121239b7ccf1425
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-0d90b97dc805e942d" id=subnet-0d90b97dc805e942d
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:security-group/sg-09743a51024b9ecbf" id=sg-09743a51024b9ecbf
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-07f0d2123e49069b4" id=subnet-07f0d2123e49069b4
level=info msg=Released arn="arn:aws:ec2:us-east-1:460538899914:elastic-ip/eipalloc-002b0ca2bbaca69c9" id=eipalloc-002b0ca2bbaca69c9
level=info msg=Released arn="arn:aws:ec2:us-east-1:460538899914:elastic-ip/eipalloc-0549c2421d18e7d69" id=eipalloc-0549c2421d18e7d69
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-0bcff76420688dc8f" id=subnet-0bcff76420688dc8f
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-0c26b8f2a4bc4d8de" id=subnet-0c26b8f2a4bc4d8de
level=info msg=Released arn="arn:aws:ec2:us-east-1:460538899914:elastic-ip/eipalloc-006407445dbc6c7ea" id=eipalloc-006407445dbc6c7ea
level=info msg=Released arn="arn:aws:ec2:us-east-1:460538899914:elastic-ip/eipalloc-06c27f76c5cd83682" id=eipalloc-06c27f76c5cd83682
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-04dfd603fb361c629" id=subnet-04dfd603fb361c629
level=info msg=Released arn="arn:aws:ec2:us-east-1:460538899914:elastic-ip/eipalloc-043b512ecdd28069f" id=eipalloc-043b512ecdd28069f
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-03c7e4fff64d4bc97" id=subnet-03c7e4fff64d4bc97
level=info msg=Released arn="arn:aws:ec2:us-east-1:460538899914:elastic-ip/eipalloc-0e4b9bee28d53ebbc" id=eipalloc-0e4b9bee28d53ebbc
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:internet-gateway/igw-0ca0fe079f943bce7" id=igw-0ca0fe079f943bce7
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-062b01bed3f5bee74" id=subnet-062b01bed3f5bee74
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:security-group/sg-0e70a157197eac76f" id=sg-0e70a157197eac76f
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" id=vpc-08956078fafcbdec4

From the timestamped .openshift_install.log, destruction started at 11:58:46, was down to the VPC, a security group, and some subnets by 12:00:00, and then wrapped up with:

time="2019-01-19T12:00:14Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-062b01bed3f5bee74" id=subnet-062b01bed3f5bee74
time="2019-01-19T12:00:14Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/ci-op-w01cqvnw-1d3f3\":\"owned\"}"
time="2019-01-19T12:00:14Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:security-group/sg-0e70a157197eac76f" id=sg-0e70a157197eac76f
time="2019-01-19T12:00:15Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" id=vpc-08956078fafcbdec4

I expect the delay was while AWS slowly tore down terminated instances or NAT gateways, or one of its other slow-to-delete resource types. Overall, that comes to an 89-second teardown. For comparison, this run (which uses the existing teardown logic from master) ran from 12:11:53 through 12:14:13, taking 140 seconds. Both of those runs are in a quiet period for the CI account and just after I'd gone through and cleaned up most leaked resources. A random run from a busy, crufty period ran from 06:13:47 through 06:16:55, taking 188 seconds. So I suspect this is going to be faster and more robust than the current master, but we're still in the same ballpark.

}
}

return lastPage
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line (and a number like it throughout this package) were buggy. I'd added the first instance of this bug in openshift/hive@d4b97563, but the docs say "To stop iterating, return false from the fn function." So this pull request fixes buggy, only-request-one-page handling for these resource types:

$ git cat-file -p b8538961a8de^:pkg/destroy/aws/aws.go | grep 'Pages(\|lastPage'
	if err := elbv2Client.DescribeLoadBalancersPages(&elbv2.DescribeLoadBalancersInput{}, func(results *elbv2.DescribeLoadBalancersOutput, lastPage bool) bool {
		return lastPage
	if err := elbv2Client.DescribeTargetGroupsPages(&elbv2.DescribeTargetGroupsInput{}, func(results *elbv2.DescribeTargetGroupsOutput, lastPage bool) bool {
		return lastPage
	err := ec2Client.DescribeInstancesPages(&describeInstancesInput, func(results *ec2.DescribeInstancesOutput, lastPage bool) bool {
		return lastPage

}
}

r
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request adds pagination for a number of resources that were missing pagination before (and even when we called *Pages() before, we were doing it wrong):

$ git grep 'Pages(' pkg/destroy/aws/aws.go
pkg/destroy/aws/aws.go:				err = tagClient.GetResourcesPages(
pkg/destroy/aws/aws.go:	err2 := search.client.ListRolesPages(
pkg/destroy/aws/aws.go:	err2 := search.client.ListUsersPages(
pkg/destroy/aws/aws.go:	err2 := client.DescribeRouteTablesPages(
pkg/destroy/aws/aws.go:	err2 := client.DescribeLoadBalancersPages(
pkg/destroy/aws/aws.go:	err2 := client.DescribeTargetGroupsPages(
pkg/destroy/aws/aws.go:	err2 := client.DescribeLoadBalancersPages(
pkg/destroy/aws/aws.go:	err2 := client.ListRolePoliciesPages(
pkg/destroy/aws/aws.go:	err2 = client.ListInstanceProfilesForRolePages(
pkg/destroy/aws/aws.go:	err2 := client.ListUserPoliciesPages(
pkg/destroy/aws/aws.go:	err2 = client.ListAccessKeysPages(
pkg/destroy/aws/aws.go:	err = client.ListResourceRecordSetsPages(
pkg/destroy/aws/aws.go:	err2 := client.ListResourceRecordSetsPages(

It's unlikely that a single role will have pages of policies, but there could certainly be pages of private zones zone. This non-paginated line would have lead to deeper zones being ignored, which may be the reason we'd sometimes leak private zones. And that in turn would have lead to leaking records in the public zone. And both of those leaks would require future destroy runs to grind through more Route 53 entries looking for things to delete, which was hurting our Route 53 API rate limits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I'd meant to comment on this line for the unpaginated-zones issue.

@wking wking force-pushed the tag-search-destroy branch from 5a2809a to 3a06697 Compare January 19, 2019 12:59
@wking
Copy link
Member Author

wking commented Jan 21, 2019

e2e-aws hung:

2019/01/19 13:34:18 Container setup in pod e2e-aws completed successfully
2019/01/19 16:17:06 Copying artifacts from e2e-aws into /logs/artifacts/e2e-aws

/retest

@dgoodwin
Copy link
Contributor

Thinking this over some more, how quickly is the new credentials operator noticing and recovering from user deletion? If this destroy script goes through and deletes the operator-created users early in the process, the cluster-API providers will no longer be authorized to create new machines, the ingress operator will no longer be authorized to create load balancers, etc. We should be able to freeze self-management by removing those users. And if the credentials operator makes itself a new, per-cluster user for creating sub-users, then this destroy code looping through and deleting per-cluster IAM users would definitely freeze everything. Thoughts, @dgoodwin?

The cred operator won't react immediately to a user deletion in AWS so it would be picked up the next Kube cycle, 10 mins max, possibly going up to 30 last I heard. Could still get unlucky but it would help. Unfortunately the credential for creating sub-users has to be given to us by the user, and is not a cred we could delete. If you were to go into etcd and wipe out the CredentialRequests that would help but we've been hesitant to get into that and hoped that deprovision would just be the same whether the cluster was live or dead.

@wking
Copy link
Member Author

wking commented Jan 21, 2019

The cred operator won't react immediately to a user deletion in AWS so it would be picked up the next Kube cycle, 10 mins max, possibly going up to 30 last I heard. Could still get unlucky but it would help. Unfortunately the credential for creating sub-users has to be given to us by the user, and is not a cred we could delete.

That seems fine to me, especially since we're looping over both machines and credentials as this PR stands. It does mean that we don't want to rely on just a credential-deleting loop to block future machine creation.

wking added a commit to wking/openshift-installer that referenced this pull request Jan 25, 2019
Fixing a bug from e24c7dc (pkg/destroy/aws: Use the resource-groups
service for tag->ARN lookup, 2019-01-10, openshift#1039).  Before this commit,
we were setting loopError, so we'd still take another pass through the
loop.  But we weren't setting 'matched', because fetch errors
(e.g. because the caller lacked tag:GetResources) would not have
returned *any* resources.  The deletion code would wrongly assume that
there were no matching resources behind that tagClient and remove the
client from tagClients.  'Run' would end up exiting non-zero despite
having abandoned the resources behind that tagClient.

With this commit, we no longer prune the tagClient.  And since we
don't distinguish between fatal and non-fatal errors, we'll just loop
forever until the caller notices the problem and kills us.  That's not
great, but with permission pre-checks in the pipe via install-time
credential operator calls, I don't know if it's worth putting in a
fatal/nonfatal distinction now.
wking added a commit to wking/openshift-installer that referenced this pull request Mar 18, 2019
Fixing some bugs from e24c7dc (pkg/destroy/aws: Use the
resource-groups service for tag->ARN lookup, 2019-01-10, openshift#1039), where
I'd meant to update the delete-function-level lastError, but had
instead been creating a new local variable inside the *Pages helper.
wking added a commit to wking/openshift-installer that referenced this pull request Mar 18, 2019
We've caught this on DeleteSecurityGroup since e24c7dc
(pkg/destroy/aws: Use the resource-groups service for tag->ARN lookup,
2019-01-10, openshift#1039), but until now we didn't handle the more-likely
case where the security group in question had been removed before the
DescribeSecurityGroups call.
wking added a commit that referenced this pull request Oct 11, 2019
Remove any 'kubernetes.io/cluster/{id}: shared' tags, such as those
that we'll use for bring-your-own subnet workflows.  The 20-tag limit
that leads to the loop is from [1].  We're unlikely to exceed 20 with
just the subnets, but it seemed reasonable to plan for the future to
avoid surprises if we grow this list going forward.

Including tagClients allows us to find and remove tags from Route 53
resources, which live in us-east-1 regardless of the rest of the
cluster's region.  More on that under "The us-east-1 business..." in
e24c7dc (pkg/destroy/aws: Use the resource-groups service for
tag->ARN lookup, 2019-01-10, #1039).  I have not added untag support
for IAM resources (which are not supported by
resourcegroupstaggingapi, more on that in e24c7dc too), but we
could add untagging for them later if we want to support
explicitly-tagged shared IAM resources.

The append([]T(nil), a...) business is a slice copy [2], so we can
drain the stack in each of our loops while still having a full
tagClients slice to feed into the next loop).

I'm calculating the shared tag by looking for existing
'kubernetes.io/cluster/{id}: owned' tags.  In most cases there will be
only one, but the metadata structure theoretically allows folks to
pass in multiple such tags with different IDs, or similar tags with
values other than 'owned'.  The logic I have here will remove 'shared'
forms of any 'owned' tags, and will warn for non-owned tags just in
case the user expects us to be removing 'shared' in those cases where
we will not.  But I'm continuing on without returning an error,
because we want 'destroy cluster' to be pretty robust.

I'm also logging (with an info-level log) errors with retrieving tags
or untagging, and but again continuing on without erroring out in
those cases.  Hopefully those errors are ephemeral and a future
attempt will punch through and succeed.  In future work, if these
errors are common enough, we might consider distinguishing between
error codes like the always-fatal ErrCodeInvalidParameterException and
the ephemeral ErrCodeThrottledException and aborting on always-fatal
errors.

The 'removed' tracker guards against redundant untag attempts in the
face of AWS eventual consistency (i.e. subsequent GetResources calls
returning tags which had been removed by earlier UntagResources
calls).  But there is a risk that we miss removing a 'shared' tag that
a parallel process re-injects while we're running:

1. We detect a shared tag on resource A.
2. We remove the tag from resource A.
3. Someone else adds it back.
4. We detect the restored tag on resource A, but interpret it as an
   eventual-consistency thing and leave it alone.

I inserted removeSharedTags at the end of the infrastructure removal
to reflect creation where we tag shared before creating other
infrastructure.  This effectively removes our reservations once we no
longer have any owned resources which could be consuming the shared
resources.  This might cause us to leak shared tags in cases where
'destroy cluster' is killed after infrastructure removal but before
shared-tag removal completed.  But there's not much you can do to
backstop that short of looking for orphaned 'shared' tags and removing
them via some custom reaper logic.

[1]: https://docs.aws.amazon.com/sdk-for-go/api/service/resourcegroupstaggingapi/#UntagResourcesInput
[2]: https://github.com/golang/go/wiki/SliceTricks#copy
jhixson74 pushed a commit to jhixson74/installer that referenced this pull request Dec 6, 2019
Remove any 'kubernetes.io/cluster/{id}: shared' tags, such as those
that we'll use for bring-your-own subnet workflows.  The 20-tag limit
that leads to the loop is from [1].  We're unlikely to exceed 20 with
just the subnets, but it seemed reasonable to plan for the future to
avoid surprises if we grow this list going forward.

Including tagClients allows us to find and remove tags from Route 53
resources, which live in us-east-1 regardless of the rest of the
cluster's region.  More on that under "The us-east-1 business..." in
e24c7dc (pkg/destroy/aws: Use the resource-groups service for
tag->ARN lookup, 2019-01-10, openshift#1039).  I have not added untag support
for IAM resources (which are not supported by
resourcegroupstaggingapi, more on that in e24c7dc too), but we
could add untagging for them later if we want to support
explicitly-tagged shared IAM resources.

The append([]T(nil), a...) business is a slice copy [2], so we can
drain the stack in each of our loops while still having a full
tagClients slice to feed into the next loop).

I'm calculating the shared tag by looking for existing
'kubernetes.io/cluster/{id}: owned' tags.  In most cases there will be
only one, but the metadata structure theoretically allows folks to
pass in multiple such tags with different IDs, or similar tags with
values other than 'owned'.  The logic I have here will remove 'shared'
forms of any 'owned' tags, and will warn for non-owned tags just in
case the user expects us to be removing 'shared' in those cases where
we will not.  But I'm continuing on without returning an error,
because we want 'destroy cluster' to be pretty robust.

I'm also logging (with an info-level log) errors with retrieving tags
or untagging, and but again continuing on without erroring out in
those cases.  Hopefully those errors are ephemeral and a future
attempt will punch through and succeed.  In future work, if these
errors are common enough, we might consider distinguishing between
error codes like the always-fatal ErrCodeInvalidParameterException and
the ephemeral ErrCodeThrottledException and aborting on always-fatal
errors.

The 'removed' tracker guards against redundant untag attempts in the
face of AWS eventual consistency (i.e. subsequent GetResources calls
returning tags which had been removed by earlier UntagResources
calls).  But there is a risk that we miss removing a 'shared' tag that
a parallel process re-injects while we're running:

1. We detect a shared tag on resource A.
2. We remove the tag from resource A.
3. Someone else adds it back.
4. We detect the restored tag on resource A, but interpret it as an
   eventual-consistency thing and leave it alone.

I inserted removeSharedTags at the end of the infrastructure removal
to reflect creation where we tag shared before creating other
infrastructure.  This effectively removes our reservations once we no
longer have any owned resources which could be consuming the shared
resources.  This might cause us to leak shared tags in cases where
'destroy cluster' is killed after infrastructure removal but before
shared-tag removal completed.  But there's not much you can do to
backstop that short of looking for orphaned 'shared' tags and removing
them via some custom reaper logic.

[1]: https://docs.aws.amazon.com/sdk-for-go/api/service/resourcegroupstaggingapi/#UntagResourcesInput
[2]: https://github.com/golang/go/wiki/SliceTricks#copy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. platform/aws size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants