pkg/destroy/aws: Use the resource-groups service for tag->ARN lookup #1039

wking · 2019-01-10T12:41:17Z

Buiilds on #1038, review that first.

I tried resourcegroups as well, but it doesn't support IAM roles or hosted zones:

FATAL Failed to destroy cluster: {"ResourceTypeFilters":["AWS::EC2::Instance","AWS::EC2::InternetGateway","AWS::EC2::NatGateway","AWS::EC2::RouteTable","AWS::EC2::SecurityGroup","AWS::EC2::Subnet","AWS::EC2::VPC","AWS::EC2::VPNGateway","AWS::Route53::HostedZone","AWS::S3::Bucket"],"TagFilters":[{"Key":"openshiftClusterID","Values":["609cbd6b-af81-4278-9285-30314d973020"]}]}: BadRequestException: Query not valid: One or more resource types are not valid: AWS::Route53::HostedZone
                     status code: 400, request id: a1388303-14a7-11e9-b813-d172f51a0eff

So with this commit I've gone with resourcegroupstaggingapi. Ingress has been doing something similar since openshift/cluster-ingress-operator@8fdcf1a0 (openshift/cluster-ingress-operator#93), so we no longer need to worry about having insufficient permissions for this action.

This will conflict with #1031 (and likely anything else touching AWS destruction), CC @dgoodwin.

For previous discussion of Route 53 quota issues see #957 and #981.

dgoodwin · 2019-01-10T12:58:34Z

Where does this leave #1031 ? Do I need to reimplement? Kind of terrible timing for that. :)

dgoodwin · 2019-01-10T12:58:56Z

CC @joelddiaz

joelddiaz · 2019-01-15T14:51:57Z

@wking one thing to look out for is the case where we delete an instance and the cluster starts creating new instances before we terminate the instance running the MCO. From a glance at the looping over tagClient[] it wasn't clear to me that the code would double-check that there truly are no more instances to delete after it has gone through the initial results of the getresourcespages() call (FWIW, I'm not at all familiar with getresourcepages).

I remember getting strange hangs during uninstall because we were racing to delete instances faster than the cluster can re-create them (that's why deleteEC2Instance() was modified to also search for 'pending' state instances).

HTH

wking · 2019-01-15T15:06:08Z

From a glance at the looping over tagClient[] it wasn't clear to me that the code would double-check...

That's what this FIXME is for ;).

cgwalters · 2019-01-18T19:11:37Z

one thing to look out for is the case where we delete an instance and the cluster starts creating new instances before we terminate the instance running the MCO.

You mean MAO i.e. machine-api-operator.

But yeah...shouldn't the destroy command try to go in and neuter the MAO in a cluster if it's running?

wking · 2019-01-18T21:49:25Z

But yeah...shouldn't the destroy command try to go in and neuter the MAO in a cluster if it's running?

It could. And it could also try to tune MachineSets to zero replicas, disable autoscalers, remove routes, etc. But we need deletion to be robust even if the cluster is borked, and once we have that, these sort of gentle-wind-down approaches seem unnecessary. But patches welcome, if it bothers folks too much to want to wait until we have time to add them. Until then, the destroy logic just has to kill the cluster faster than it can self-heal ;).

wking · 2019-01-19T07:24:46Z

Until then, the destroy logic just has to kill the cluster faster than it can self-heal ;).

Thinking this over some more, how quickly is the new credentials operator noticing and recovering from user deletion? If this destroy script goes through and deletes the operator-created users early in the process, the cluster-API providers will no longer be authorized to create new machines, the ingress operator will no longer be authorized to create load balancers, etc. We should be able to freeze self-management by removing those users. And if the credentials operator makes itself a new, per-cluster user for creating sub-users, then this destroy code looping through and deleting per-cluster IAM users would definitely freeze everything. Thoughts, @dgoodwin?

wking · 2019-01-19T12:38:25Z

Logs from the successful teardown, as seen by users at the default info log level:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1039/pull-ci-openshift-installer-master-e2e-aws/3043/artifacts/e2e-aws/container-logs/teardown.log.gz | gunzip | grep -A100 'Deprovisioning'
Deprovisioning cluster ...
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-08d54d07c6377bb60" id=rtbassoc-0f10b91d5820124c7
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-08d54d07c6377bb60" id=rtb-08d54d07c6377bb60
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-07606cb62583edfdf" id=rtbassoc-004589319449dabae
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-07606cb62583edfdf" id=rtb-07606cb62583edfdf
level=info msg=Disassociated IAM instance profile="arn:aws:iam::460538899914:instance-profile/ci-op-w01cqvnw-1d3f3-master-profile" arn="arn:aws:ec2:us-east-1:460538899914:instance/i-07f7b16c957a863ae" id=i-07f7b16c957a863ae name=ci-op-w01cqvnw-1d3f3-master-profile role=ci-op-w01cqvnw-1d3f3-master-role
level=info msg=Deleted IAM instance profile="arn:aws:iam::460538899914:instance-profile/ci-op-w01cqvnw-1d3f3-master-profile" arn="arn:aws:ec2:us-east-1:460538899914:instance/i-07f7b16c957a863ae" id=i-07f7b16c957a863ae name=ci-op-w01cqvnw-1d3f3-master-profile
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-07f7b16c957a863ae" id=i-07f7b16c957a863ae
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-0f15e4df9edf7e7b3" id=rtbassoc-03063b5a8fba5ad73
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-0f15e4df9edf7e7b3" id=rtb-0f15e4df9edf7e7b3
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-0f406bcd82b0b270a" id=rtbassoc-0706552e38d310167
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-0f406bcd82b0b270a" id=rtbassoc-0fafd82b0ae07e1d7
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-0f406bcd82b0b270a" id=rtbassoc-07759e5b53ac430b8
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-0f406bcd82b0b270a" id=rtbassoc-01c0edee4bfdbf646
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-0f406bcd82b0b270a" id=rtbassoc-04d2f35954cec0880
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-0f406bcd82b0b270a" id=rtbassoc-0d6b9e341743cc5e7
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-05333e0adc4b369d5" id=i-05333e0adc4b369d5
level=info msg=Disassociated IAM instance profile="arn:aws:iam::460538899914:instance-profile/ci-op-w01cqvnw-1d3f3-worker-profile" arn="arn:aws:ec2:us-east-1:460538899914:instance/i-0b8aef30b2b85a2c9" id=i-0b8aef30b2b85a2c9 name=ci-op-w01cqvnw-1d3f3-worker-profile role=ci-op-w01cqvnw-1d3f3-worker-role
level=info msg=Deleted IAM instance profile="arn:aws:iam::460538899914:instance-profile/ci-op-w01cqvnw-1d3f3-worker-profile" arn="arn:aws:ec2:us-east-1:460538899914:instance/i-0b8aef30b2b85a2c9" id=i-0b8aef30b2b85a2c9 name=ci-op-w01cqvnw-1d3f3-worker-profile
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-0b8aef30b2b85a2c9" id=i-0b8aef30b2b85a2c9
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-0ef7376ee975d7c12" id=rtbassoc-01b934575f14b766b
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:route-table/rtb-0ef7376ee975d7c12" id=rtb-0ef7376ee975d7c12
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-05487001a07a5b5be" id=i-05487001a07a5b5be
level=info msg=Deleted arn="arn:aws:route53:::hostedzone/ZAXM8N9Z96NJ3" id=ZAXM8N9Z96NJ3 public zone=/hostedzone/Z328TAU66YDJH7 record set="A ci-op-w01cqvnw-1d3f3-api.origin-ci-int-aws.dev.rhcloud.com."
level=info msg=Deleted arn="arn:aws:route53:::hostedzone/ZAXM8N9Z96NJ3" id=ZAXM8N9Z96NJ3 record set="A ci-op-w01cqvnw-1d3f3-api.origin-ci-int-aws.dev.rhcloud.com."
level=info msg=Deleted arn="arn:aws:route53:::hostedzone/ZAXM8N9Z96NJ3" id=ZAXM8N9Z96NJ3 record set="A ci-op-w01cqvnw-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com."
level=info msg=Deleted arn="arn:aws:route53:::hostedzone/ZAXM8N9Z96NJ3" id=ZAXM8N9Z96NJ3 record set="A ci-op-w01cqvnw-1d3f3-etcd-1.origin-ci-int-aws.dev.rhcloud.com."
level=info msg=Deleted arn="arn:aws:route53:::hostedzone/ZAXM8N9Z96NJ3" id=ZAXM8N9Z96NJ3 record set="A ci-op-w01cqvnw-1d3f3-etcd-2.origin-ci-int-aws.dev.rhcloud.com."
level=info msg=Deleted arn="arn:aws:route53:::hostedzone/ZAXM8N9Z96NJ3" id=ZAXM8N9Z96NJ3 record set="SRV _etcd-server-ssl._tcp.ci-op-w01cqvnw-1d3f3.origin-ci-int-aws.dev.rhcloud.com."
level=info msg=Deleted arn="arn:aws:route53:::hostedzone/ZAXM8N9Z96NJ3" id=ZAXM8N9Z96NJ3 public zone=/hostedzone/Z328TAU66YDJH7 record set="A \\052.apps.ci-op-w01cqvnw-1d3f3.origin-ci-int-aws.dev.rhcloud.com."
level=info msg=Deleted arn="arn:aws:route53:::hostedzone/ZAXM8N9Z96NJ3" id=ZAXM8N9Z96NJ3 record set="A \\052.apps.ci-op-w01cqvnw-1d3f3.origin-ci-int-aws.dev.rhcloud.com."
level=info msg=Deleted arn="arn:aws:route53:::hostedzone/ZAXM8N9Z96NJ3" id=ZAXM8N9Z96NJ3
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-0777455b0434571a8" id=i-0777455b0434571a8
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" classic load balancer=aebc687541bdc11e9a1e20e71c77f5cc id=vpc-08956078fafcbdec4
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" id=vpc-08956078fafcbdec4 load balancer=loadbalancer/net/ci-op-w01cqvnw-1d3f3-int/5b495b87ac6b6e52
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" id=vpc-08956078fafcbdec4 load balancer=loadbalancer/net/ci-op-w01cqvnw-1d3f3-ext/ce6607e077507783
level=info msg=Deleted VPC endpoint=vpce-045b9ea5642409aff arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" id=vpc-08956078fafcbdec4
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" id=vpc-08956078fafcbdec4 table=rtb-05d72d70d343560ce
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" id=rtbassoc-05e90193769663473 table=rtb-07fd0d17bf60c08f7
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" id=vpc-08956078fafcbdec4 table=rtb-07fd0d17bf60c08f7
level=info msg=Disassociated arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" id=rtbassoc-010d76b6ab4efe1e6 table=rtb-0e73c4635365e8141
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" id=vpc-08956078fafcbdec4 table=rtb-0e73c4635365e8141
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-0c553c581c0c05939" id=subnet-0c553c581c0c05939
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-06416a72fbd020c37" id=subnet-06416a72fbd020c37
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:natgateway/nat-00a6de3c0f31ba2ea" id=nat-00a6de3c0f31ba2ea
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:natgateway/nat-0317ccd8c1bee5b7b" id=nat-0317ccd8c1bee5b7b
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:natgateway/nat-0362f03c7b460dfc8" id=nat-0362f03c7b460dfc8
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:natgateway/nat-089a90316703bf8cd" id=nat-089a90316703bf8cd
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:natgateway/nat-0c22fa1228a07ce20" id=nat-0c22fa1228a07ce20
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:natgateway/nat-0c442b564a8c666d2" id=nat-0c442b564a8c666d2
level=info msg=Deleted arn="arn:aws:elasticloadbalancing:us-east-1:460538899914:targetgroup/ci-op-w01cqvnw-1d3f3-api-ext/8b8380f21e21a8fe" id=ci-op-w01cqvnw-1d3f3-api-ext/8b8380f21e21a8fe
level=info msg=Deleted arn="arn:aws:elasticloadbalancing:us-east-1:460538899914:targetgroup/ci-op-w01cqvnw-1d3f3-api-int/2cbd81986eb4d9d6" id=ci-op-w01cqvnw-1d3f3-api-int/2cbd81986eb4d9d6
level=info msg=Deleted arn="arn:aws:elasticloadbalancing:us-east-1:460538899914:targetgroup/ci-op-w01cqvnw-1d3f3-services/5936e74f37b60521" id=ci-op-w01cqvnw-1d3f3-services/5936e74f37b60521
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-0df56db9b6ea4838d" id=i-0df56db9b6ea4838d
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-0e6baced9303860a6" id=subnet-0e6baced9303860a6
level=info msg=Deleted arn="arn:aws:s3:::image-registry-us-east-1-55206167869847a997638fcd23075f5f-ecc7"
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:security-group/sg-066bb64da56410b9d" id=sg-066bb64da56410b9d
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:security-group/sg-0ec4ea796679549cc" id=sg-0ec4ea796679549cc
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:security-group/sg-036b0d0a136290d41" id=sg-036b0d0a136290d41
level=info msg=Deleted arn="arn:aws:iam::460538899914:role/ci-op-w01cqvnw-1d3f3-master-role" id=ci-op-w01cqvnw-1d3f3-master-role name=ci-op-w01cqvnw-1d3f3-master-role policy=ci-op-w01cqvnw-1d3f3_master_policy
level=info msg=Deleted arn="arn:aws:iam::460538899914:role/ci-op-w01cqvnw-1d3f3-master-role" id=ci-op-w01cqvnw-1d3f3-master-role name=ci-op-w01cqvnw-1d3f3-master-role
level=info msg=Deleted arn="arn:aws:iam::460538899914:role/ci-op-w01cqvnw-1d3f3-worker-role" id=ci-op-w01cqvnw-1d3f3-worker-role name=ci-op-w01cqvnw-1d3f3-worker-role policy=ci-op-w01cqvnw-1d3f3_worker_policy
level=info msg=Deleted arn="arn:aws:iam::460538899914:role/ci-op-w01cqvnw-1d3f3-worker-role" id=ci-op-w01cqvnw-1d3f3-worker-role name=ci-op-w01cqvnw-1d3f3-worker-role
level=info msg=Deleted arn="arn:aws:iam::460538899914:user/cloud-credential-operator-iam-ro-fwkzb" id=cloud-credential-operator-iam-ro-fwkzb policy=cloud-credential-operator-iam-ro-fwkzb-policy
level=info msg=Deleted arn="arn:aws:iam::460538899914:user/cloud-credential-operator-iam-ro-fwkzb" id=cloud-credential-operator-iam-ro-fwkzb
level=info msg=Deleted arn="arn:aws:iam::460538899914:user/openshift-image-registry-q6p2t" id=openshift-image-registry-q6p2t policy=openshift-image-registry-q6p2t-policy
level=info msg=Deleted arn="arn:aws:iam::460538899914:user/openshift-image-registry-q6p2t" id=openshift-image-registry-q6p2t
level=info msg=Deleted arn="arn:aws:iam::460538899914:user/openshift-ingress-mc9xp" id=openshift-ingress-mc9xp policy=openshift-ingress-mc9xp-policy
level=info msg=Deleted arn="arn:aws:iam::460538899914:user/openshift-ingress-mc9xp" id=openshift-ingress-mc9xp
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:volume/vol-0a62e6cc8a74478dd" id=vol-0a62e6cc8a74478dd
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:volume/vol-059083b04f4e4c2c7" id=vol-059083b04f4e4c2c7
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:volume/vol-05b1b3e362292a67b" id=vol-05b1b3e362292a67b
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:volume/vol-0d859f7e6e4690e39" id=vol-0d859f7e6e4690e39
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-0062a468064709bfd" id=subnet-0062a468064709bfd
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-0debe524c9af7eee1" id=subnet-0debe524c9af7eee1
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:volume/vol-028bb2f0fae8682dd" id=vol-028bb2f0fae8682dd
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:volume/vol-065c16f902036b5ab" id=vol-065c16f902036b5ab
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:security-group/sg-05121239b7ccf1425" id=sg-05121239b7ccf1425
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-0d90b97dc805e942d" id=subnet-0d90b97dc805e942d
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:security-group/sg-09743a51024b9ecbf" id=sg-09743a51024b9ecbf
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-07f0d2123e49069b4" id=subnet-07f0d2123e49069b4
level=info msg=Released arn="arn:aws:ec2:us-east-1:460538899914:elastic-ip/eipalloc-002b0ca2bbaca69c9" id=eipalloc-002b0ca2bbaca69c9
level=info msg=Released arn="arn:aws:ec2:us-east-1:460538899914:elastic-ip/eipalloc-0549c2421d18e7d69" id=eipalloc-0549c2421d18e7d69
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-0bcff76420688dc8f" id=subnet-0bcff76420688dc8f
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-0c26b8f2a4bc4d8de" id=subnet-0c26b8f2a4bc4d8de
level=info msg=Released arn="arn:aws:ec2:us-east-1:460538899914:elastic-ip/eipalloc-006407445dbc6c7ea" id=eipalloc-006407445dbc6c7ea
level=info msg=Released arn="arn:aws:ec2:us-east-1:460538899914:elastic-ip/eipalloc-06c27f76c5cd83682" id=eipalloc-06c27f76c5cd83682
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-04dfd603fb361c629" id=subnet-04dfd603fb361c629
level=info msg=Released arn="arn:aws:ec2:us-east-1:460538899914:elastic-ip/eipalloc-043b512ecdd28069f" id=eipalloc-043b512ecdd28069f
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-03c7e4fff64d4bc97" id=subnet-03c7e4fff64d4bc97
level=info msg=Released arn="arn:aws:ec2:us-east-1:460538899914:elastic-ip/eipalloc-0e4b9bee28d53ebbc" id=eipalloc-0e4b9bee28d53ebbc
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:internet-gateway/igw-0ca0fe079f943bce7" id=igw-0ca0fe079f943bce7
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-062b01bed3f5bee74" id=subnet-062b01bed3f5bee74
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:security-group/sg-0e70a157197eac76f" id=sg-0e70a157197eac76f
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" id=vpc-08956078fafcbdec4

From the timestamped .openshift_install.log, destruction started at 11:58:46, was down to the VPC, a security group, and some subnets by 12:00:00, and then wrapped up with:

time="2019-01-19T12:00:14Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:subnet/subnet-062b01bed3f5bee74" id=subnet-062b01bed3f5bee74
time="2019-01-19T12:00:14Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/ci-op-w01cqvnw-1d3f3\":\"owned\"}"
time="2019-01-19T12:00:14Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:security-group/sg-0e70a157197eac76f" id=sg-0e70a157197eac76f
time="2019-01-19T12:00:15Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-08956078fafcbdec4" id=vpc-08956078fafcbdec4

I expect the delay was while AWS slowly tore down terminated instances or NAT gateways, or one of its other slow-to-delete resource types. Overall, that comes to an 89-second teardown. For comparison, this run (which uses the existing teardown logic from master) ran from 12:11:53 through 12:14:13, taking 140 seconds. Both of those runs are in a quiet period for the CI account and just after I'd gone through and cleaned up most leaked resources. A random run from a busy, crufty period ran from 06:13:47 through 06:16:55, taking 188 seconds. So I suspect this is going to be faster and more robust than the current master, but we're still in the same ballpark.

wking · 2019-01-19T12:49:40Z

pkg/destroy/aws/aws.go

-			}
-		}
-
-		return lastPage


This line (and a number like it throughout this package) were buggy. I'd added the first instance of this bug in openshift/hive@d4b97563, but the docs say "To stop iterating, return false from the fn function." So this pull request fixes buggy, only-request-one-page handling for these resource types:

$ git cat-file -p b8538961a8de^:pkg/destroy/aws/aws.go | grep 'Pages(\|lastPage' if err := elbv2Client.DescribeLoadBalancersPages(&elbv2.DescribeLoadBalancersInput{}, func(results *elbv2.DescribeLoadBalancersOutput, lastPage bool) bool { return lastPage if err := elbv2Client.DescribeTargetGroupsPages(&elbv2.DescribeTargetGroupsInput{}, func(results *elbv2.DescribeTargetGroupsOutput, lastPage bool) bool { return lastPage err := ec2Client.DescribeInstancesPages(&describeInstancesInput, func(results *ec2.DescribeInstancesOutput, lastPage bool) bool { return lastPage

wking · 2019-01-19T12:58:05Z

pkg/destroy/aws/aws.go

+				}
+			}
+
+			r


This pull request adds pagination for a number of resources that were missing pagination before (and even when we called *Pages() before, we were doing it wrong):

$ git grep 'Pages(' pkg/destroy/aws/aws.go pkg/destroy/aws/aws.go: err = tagClient.GetResourcesPages( pkg/destroy/aws/aws.go: err2 := search.client.ListRolesPages( pkg/destroy/aws/aws.go: err2 := search.client.ListUsersPages( pkg/destroy/aws/aws.go: err2 := client.DescribeRouteTablesPages( pkg/destroy/aws/aws.go: err2 := client.DescribeLoadBalancersPages( pkg/destroy/aws/aws.go: err2 := client.DescribeTargetGroupsPages( pkg/destroy/aws/aws.go: err2 := client.DescribeLoadBalancersPages( pkg/destroy/aws/aws.go: err2 := client.ListRolePoliciesPages( pkg/destroy/aws/aws.go: err2 = client.ListInstanceProfilesForRolePages( pkg/destroy/aws/aws.go: err2 := client.ListUserPoliciesPages( pkg/destroy/aws/aws.go: err2 = client.ListAccessKeysPages( pkg/destroy/aws/aws.go: err = client.ListResourceRecordSetsPages( pkg/destroy/aws/aws.go: err2 := client.ListResourceRecordSetsPages(

It's unlikely that a single role will have pages of policies, but there could certainly be pages of private zones zone. This non-paginated line would have lead to deeper zones being ignored, which may be the reason we'd sometimes leak private zones. And that in turn would have lead to leaking records in the public zone. And both of those leaks would require future destroy runs to grind through more Route 53 entries looking for things to delete, which was hurting our Route 53 API rate limits.

Oops, I'd meant to comment on this line for the unpaginated-zones issue.

wking · 2019-01-21T05:35:28Z

e2e-aws hung:

2019/01/19 13:34:18 Container setup in pod e2e-aws completed successfully
2019/01/19 16:17:06 Copying artifacts from e2e-aws into /logs/artifacts/e2e-aws

/retest

dgoodwin · 2019-01-21T12:04:55Z

Thinking this over some more, how quickly is the new credentials operator noticing and recovering from user deletion? If this destroy script goes through and deletes the operator-created users early in the process, the cluster-API providers will no longer be authorized to create new machines, the ingress operator will no longer be authorized to create load balancers, etc. We should be able to freeze self-management by removing those users. And if the credentials operator makes itself a new, per-cluster user for creating sub-users, then this destroy code looping through and deleting per-cluster IAM users would definitely freeze everything. Thoughts, @dgoodwin?

The cred operator won't react immediately to a user deletion in AWS so it would be picked up the next Kube cycle, 10 mins max, possibly going up to 30 last I heard. Could still get unlucky but it would help. Unfortunately the credential for creating sub-users has to be given to us by the user, and is not a cred we could delete. If you were to go into etcd and wipe out the CredentialRequests that would help but we've been hesitant to get into that and hoped that deprovision would just be the same whether the cluster was live or dead.

wking · 2019-01-21T17:35:18Z

The cred operator won't react immediately to a user deletion in AWS so it would be picked up the next Kube cycle, 10 mins max, possibly going up to 30 last I heard. Could still get unlucky but it would help. Unfortunately the credential for creating sub-users has to be given to us by the user, and is not a cred we could delete.

That seems fine to me, especially since we're looping over both machines and credentials as this PR stands. It does mean that we don't want to rely on just a credential-deleting loop to block future machine creation.

Fixing a bug from e24c7dc (pkg/destroy/aws: Use the resource-groups service for tag->ARN lookup, 2019-01-10, openshift#1039). Before this commit, we were setting loopError, so we'd still take another pass through the loop. But we weren't setting 'matched', because fetch errors (e.g. because the caller lacked tag:GetResources) would not have returned *any* resources. The deletion code would wrongly assume that there were no matching resources behind that tagClient and remove the client from tagClients. 'Run' would end up exiting non-zero despite having abandoned the resources behind that tagClient. With this commit, we no longer prune the tagClient. And since we don't distinguish between fatal and non-fatal errors, we'll just loop forever until the caller notices the problem and kills us. That's not great, but with permission pre-checks in the pipe via install-time credential operator calls, I don't know if it's worth putting in a fatal/nonfatal distinction now.

Fixing some bugs from e24c7dc (pkg/destroy/aws: Use the resource-groups service for tag->ARN lookup, 2019-01-10, openshift#1039), where I'd meant to update the delete-function-level lastError, but had instead been creating a new local variable inside the *Pages helper.

We've caught this on DeleteSecurityGroup since e24c7dc (pkg/destroy/aws: Use the resource-groups service for tag->ARN lookup, 2019-01-10, openshift#1039), but until now we didn't handle the more-likely case where the security group in question had been removed before the DescribeSecurityGroups call.

Remove any 'kubernetes.io/cluster/{id}: shared' tags, such as those that we'll use for bring-your-own subnet workflows. The 20-tag limit that leads to the loop is from [1]. We're unlikely to exceed 20 with just the subnets, but it seemed reasonable to plan for the future to avoid surprises if we grow this list going forward. Including tagClients allows us to find and remove tags from Route 53 resources, which live in us-east-1 regardless of the rest of the cluster's region. More on that under "The us-east-1 business..." in e24c7dc (pkg/destroy/aws: Use the resource-groups service for tag->ARN lookup, 2019-01-10, #1039). I have not added untag support for IAM resources (which are not supported by resourcegroupstaggingapi, more on that in e24c7dc too), but we could add untagging for them later if we want to support explicitly-tagged shared IAM resources. The append([]T(nil), a...) business is a slice copy [2], so we can drain the stack in each of our loops while still having a full tagClients slice to feed into the next loop). I'm calculating the shared tag by looking for existing 'kubernetes.io/cluster/{id}: owned' tags. In most cases there will be only one, but the metadata structure theoretically allows folks to pass in multiple such tags with different IDs, or similar tags with values other than 'owned'. The logic I have here will remove 'shared' forms of any 'owned' tags, and will warn for non-owned tags just in case the user expects us to be removing 'shared' in those cases where we will not. But I'm continuing on without returning an error, because we want 'destroy cluster' to be pretty robust. I'm also logging (with an info-level log) errors with retrieving tags or untagging, and but again continuing on without erroring out in those cases. Hopefully those errors are ephemeral and a future attempt will punch through and succeed. In future work, if these errors are common enough, we might consider distinguishing between error codes like the always-fatal ErrCodeInvalidParameterException and the ephemeral ErrCodeThrottledException and aborting on always-fatal errors. The 'removed' tracker guards against redundant untag attempts in the face of AWS eventual consistency (i.e. subsequent GetResources calls returning tags which had been removed by earlier UntagResources calls). But there is a risk that we miss removing a 'shared' tag that a parallel process re-injects while we're running: 1. We detect a shared tag on resource A. 2. We remove the tag from resource A. 3. Someone else adds it back. 4. We detect the restored tag on resource A, but interpret it as an eventual-consistency thing and leave it alone. I inserted removeSharedTags at the end of the infrastructure removal to reflect creation where we tag shared before creating other infrastructure. This effectively removes our reservations once we no longer have any owned resources which could be consuming the shared resources. This might cause us to leak shared tags in cases where 'destroy cluster' is killed after infrastructure removal but before shared-tag removal completed. But there's not much you can do to backstop that short of looking for orphaned 'shared' tags and removing them via some custom reaper logic. [1]: https://docs.aws.amazon.com/sdk-for-go/api/service/resourcegroupstaggingapi/#UntagResourcesInput [2]: https://github.com/golang/go/wiki/SliceTricks#copy

Remove any 'kubernetes.io/cluster/{id}: shared' tags, such as those that we'll use for bring-your-own subnet workflows. The 20-tag limit that leads to the loop is from [1]. We're unlikely to exceed 20 with just the subnets, but it seemed reasonable to plan for the future to avoid surprises if we grow this list going forward. Including tagClients allows us to find and remove tags from Route 53 resources, which live in us-east-1 regardless of the rest of the cluster's region. More on that under "The us-east-1 business..." in e24c7dc (pkg/destroy/aws: Use the resource-groups service for tag->ARN lookup, 2019-01-10, openshift#1039). I have not added untag support for IAM resources (which are not supported by resourcegroupstaggingapi, more on that in e24c7dc too), but we could add untagging for them later if we want to support explicitly-tagged shared IAM resources. The append([]T(nil), a...) business is a slice copy [2], so we can drain the stack in each of our loops while still having a full tagClients slice to feed into the next loop). I'm calculating the shared tag by looking for existing 'kubernetes.io/cluster/{id}: owned' tags. In most cases there will be only one, but the metadata structure theoretically allows folks to pass in multiple such tags with different IDs, or similar tags with values other than 'owned'. The logic I have here will remove 'shared' forms of any 'owned' tags, and will warn for non-owned tags just in case the user expects us to be removing 'shared' in those cases where we will not. But I'm continuing on without returning an error, because we want 'destroy cluster' to be pretty robust. I'm also logging (with an info-level log) errors with retrieving tags or untagging, and but again continuing on without erroring out in those cases. Hopefully those errors are ephemeral and a future attempt will punch through and succeed. In future work, if these errors are common enough, we might consider distinguishing between error codes like the always-fatal ErrCodeInvalidParameterException and the ephemeral ErrCodeThrottledException and aborting on always-fatal errors. The 'removed' tracker guards against redundant untag attempts in the face of AWS eventual consistency (i.e. subsequent GetResources calls returning tags which had been removed by earlier UntagResources calls). But there is a risk that we miss removing a 'shared' tag that a parallel process re-injects while we're running: 1. We detect a shared tag on resource A. 2. We remove the tag from resource A. 3. Someone else adds it back. 4. We detect the restored tag on resource A, but interpret it as an eventual-consistency thing and leave it alone. I inserted removeSharedTags at the end of the infrastructure removal to reflect creation where we tag shared before creating other infrastructure. This effectively removes our reservations once we no longer have any owned resources which could be consuming the shared resources. This might cause us to leak shared tags in cases where 'destroy cluster' is killed after infrastructure removal but before shared-tag removal completed. But there's not much you can do to backstop that short of looking for orphaned 'shared' tags and removing them via some custom reaper logic. [1]: https://docs.aws.amazon.com/sdk-for-go/api/service/resourcegroupstaggingapi/#UntagResourcesInput [2]: https://github.com/golang/go/wiki/SliceTricks#copy

openshift-ci-robot requested review from crawford and steveej January 10, 2019 12:41

wking mentioned this pull request Jan 10, 2019

v0.8 installs on AWS always get a throttling error #981

Closed

wking mentioned this pull request Jan 10, 2019

Add support for deprovisioning users on AWS. #1031

Merged

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 14, 2019

wking added the platform/aws label Jan 18, 2019

wking force-pushed the tag-search-destroy branch from 1258fd5 to eb5d7e0 Compare January 19, 2019 07:19

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 19, 2019

wking changed the title ~~WIP: pkg/destroy/aws: Use the resource-groups service for tag->ARN lookup~~ pkg/destroy/aws: Use the resource-groups service for tag->ARN lookup Jan 19, 2019

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 19, 2019

wking force-pushed the tag-search-destroy branch 3 times, most recently from 8fc1af0 to 5a2809a Compare January 19, 2019 10:56

wking commented Jan 19, 2019

View reviewed changes

wking force-pushed the tag-search-destroy branch from 5a2809a to 3a06697 Compare January 19, 2019 12:59

wking mentioned this pull request Jan 24, 2019

Updated README.md with instructions how to clean conflicting IAM profiles #1124

Closed

wking mentioned this pull request Jan 25, 2019

pkg/destroy/aws: Set 'matched' on tag-pagination errors #1129

Merged

wking mentioned this pull request Jan 31, 2019

pkg/destroy/aws: remove un-used cluster_name field #1170

Merged

This was referenced Jan 31, 2019

credentials validation #1156

Merged

Openshift installer is not correctly tracking terraform resources #1000

Closed

wking mentioned this pull request Oct 7, 2019

pkg/destroy/aws: Untag shared resources #2467

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/destroy/aws: Use the resource-groups service for tag->ARN lookup #1039

pkg/destroy/aws: Use the resource-groups service for tag->ARN lookup #1039

wking commented Jan 10, 2019 •

edited

Loading

dgoodwin commented Jan 10, 2019

dgoodwin commented Jan 10, 2019

joelddiaz commented Jan 15, 2019

wking commented Jan 15, 2019

cgwalters commented Jan 18, 2019

wking commented Jan 18, 2019 •

edited

Loading

wking commented Jan 19, 2019

wking commented Jan 19, 2019

wking Jan 19, 2019

wking Jan 19, 2019

wking Jan 19, 2019

wking commented Jan 21, 2019

dgoodwin commented Jan 21, 2019

wking commented Jan 21, 2019

+              				}
+              			}
+              			r

pkg/destroy/aws: Use the resource-groups service for tag->ARN lookup #1039

pkg/destroy/aws: Use the resource-groups service for tag->ARN lookup #1039

Conversation

wking commented Jan 10, 2019 • edited Loading

dgoodwin commented Jan 10, 2019

dgoodwin commented Jan 10, 2019

joelddiaz commented Jan 15, 2019

wking commented Jan 15, 2019

cgwalters commented Jan 18, 2019

wking commented Jan 18, 2019 • edited Loading

wking commented Jan 19, 2019

wking commented Jan 19, 2019

wking Jan 19, 2019

Choose a reason for hiding this comment

wking Jan 19, 2019

Choose a reason for hiding this comment

wking Jan 19, 2019

Choose a reason for hiding this comment

wking commented Jan 21, 2019

dgoodwin commented Jan 21, 2019

wking commented Jan 21, 2019

wking commented Jan 10, 2019 •

edited

Loading

wking commented Jan 18, 2019 •

edited

Loading