Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛Cleanup ports #1063

Merged
merged 1 commit into from
Dec 1, 2021
Merged

Conversation

mdbooth
Copy link
Contributor

@mdbooth mdbooth commented Nov 24, 2021

This PR is only the final commit in the series! All other commits are from #1061, which needs to merge first

What this PR does / why we need it:
This fixes a leak of ports when failing to fetch image or flavor for any reason, including user error, during instance creation.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1062

Special notes for your reviewer:

  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

/hold until #1061 merges

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 24, 2021
@netlify
Copy link

netlify bot commented Nov 24, 2021

✔️ Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

🔨 Explore the source changes: 7a56ed0

🔍 Inspect the deploy log: https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/61a621627d4a330007e7f0e4

😎 Browse the preview: https://deploy-preview-1063--kubernetes-sigs-cluster-api-openstack.netlify.app

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 24, 2021
Copy link
Member

@chrischdi chrischdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The business looks way more clean 👍

I'd say mission accomplished 😃

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 25, 2021
@chrischdi
Copy link
Member

Looks like some prow / gcloud error unrelated to this PR on a first view?

@mdbooth
Copy link
Contributor Author

mdbooth commented Nov 25, 2021

Looks like some prow / gcloud error unrelated to this PR on a first view?

Kubelet didn't come up correctly on the control plane node:
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/1063/pull-cluster-api-provider-openstack-e2e-test/1463518333197357056/artifacts/clusters/bootstrap/instances/e2e-d43a5h/cluster-e2e-d43a5h-control-plane-p2h2j/cloud-final.log

I haven't yet worked out how to read kubelet's startup logs, so it's anybody's guess why. It's a flake, though.

@mdbooth
Copy link
Contributor Author

mdbooth commented Nov 25, 2021

/test pull-cluster-api-provider-openstack-e2e-test

@chrischdi
Copy link
Member

Looks like some prow / gcloud error unrelated to this PR on a first view?

Kubelet didn't come up correctly on the control plane node: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/1063/pull-cluster-api-provider-openstack-e2e-test/1463518333197357056/artifacts/clusters/bootstrap/instances/e2e-d43a5h/cluster-e2e-d43a5h-control-plane-p2h2j/cloud-final.log

I haven't yet worked out how to read kubelet's startup logs, so it's anybody's guess why. It's a flake, though.

All I can see is that the kubelet itself is not able to connect to the kube-apiserver to register himself as node.

I also can't find any logs of the static pods :-(

@mdbooth
Copy link
Contributor Author

mdbooth commented Nov 26, 2021

From CAPO controller logs:

I1125 15:22:55.469152       1 recorder.go:103] events "msg"="Warning"  "message"="Failed to delete security group k8s-cluster-e2e-158vcy-cluster-e2e-158vcy-secgroup-worker with id 94947127-7579-40e3-960d-98e1b600b423: Expected HTTP response code [] when accessing [DELETE http://10.0.3.15:9696/v2.0/security-groups/94947127-7579-40e3-960d-98e1b600b423], but got 409 instead\n{\"NeutronError\": {\"type\": \"SecurityGroupInUse\", \"message\": \"Security Group 94947127-7579-40e3-960d-98e1b600b423 in use.\", \"detail\": \"\"}}" "object"={"kind":"OpenStackCluster","namespace":"e2e-158vcy","name":"cluster-e2e-158vcy","uid":"19281045-6a3e-4d05-bb9f-72fe2868a2a9","apiVersion":"infrastructure.cluster.x-k8s.io/v1beta1","resourceVersion":"6621"} "reason"="FailedDeleteSecurityGroup"
E1125 15:22:55.470519       1 controller.go:317] controller/openstackcluster "msg"="Reconciler error" "error"="failed to delete security groups: Expected HTTP response code [] when accessing [DELETE http://10.0.3.15:9696/v2.0/security-groups/94947127-7579-40e3-960d-98e1b600b423], but got 409 instead\n{\"NeutronError\": {\"type\": \"SecurityGroupInUse\", \"message\": \"Security Group 94947127-7579-40e3-960d-98e1b600b423 in use.\", \"detail\": \"\"}}" "name"="cluster-e2e-158vcy" "namespace"="e2e-158vcy" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackCluster" 

I can't see any obvious reason for this in either the CAPO or the devstack logs. If I had to put my money somewhere, I would guess this is a neutron bug. Everything else seems to have been deleted AFAICT.

@mdbooth
Copy link
Contributor Author

mdbooth commented Nov 26, 2021

/test pull-cluster-api-provider-openstack-e2e-test

@mdbooth
Copy link
Contributor Author

mdbooth commented Nov 26, 2021

Same issue: failure to delete security group:

I1126 13:28:32.913722       1 recorder.go:103] events "msg"="Warning"  "message"="Failed to delete security group k8s-cluster-e2e-3elhda-cluster-e2e-3elhda-secgroup-worker with id adde714f-5834-4e02-83e5-92ba02e555d6: Expected HTTP response code [] when accessing [DELETE http://10.0.3.15:9696/v2.0/security-groups/adde714f-5834-4e02-83e5-92ba02e555d6], but got 409 instead\n{\"NeutronError\": {\"type\": \"SecurityGroupInUse\", \"message\": \"Security Group adde714f-5834-4e02-83e5-92ba02e555d6 in use.\", \"detail\": \"\"}}" "object"={"kind":"OpenStackCluster","namespace":"e2e-3elhda","name":"cluster-e2e-3elhda","uid":"5c537fb3-a12e-47a4-bc95-92cc01478386","apiVersion":"infrastructure.cluster.x-k8s.io/v1beta1","resourceVersion":"6334"} "reason"="FailedDeleteSecurityGroup"

It's the same security group and the same test in both cases, so this is starting to look deterministic.

Possibly related: I notice that several changes landed in neutron stable/victoria on 24th November, which is when this change seems to have started failing.

@mdbooth
Copy link
Contributor Author

mdbooth commented Nov 26, 2021

It's a bug introduced by this patch! client.CreateServer() returns *ServerExt, error. The port cleanup function doesn't cleanup if server != nil, but CreateServer` never returns nil for the server, even if there was an error 🤦‍♂️

Consequently the failed create doesn't cleanup ports, and the security group is still in use.

I've probably made this mistake systematically in a few places. I'll update.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 26, 2021
@mdbooth
Copy link
Contributor Author

mdbooth commented Nov 26, 2021

Fixed it 🎉

Still need to merge #1061 first, which is under review with @macaptain.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 30, 2021
Copy link
Member

@chrischdi chrischdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other PR was merged.

One last idea: do we want to move the defer for deleting ports above the ports creation?

var server *ServerExt

// Ensure we delete the ports we created if we haven't created the server.
defer func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the defer maybe also be above line 184?

If creating multiple ports and a port fails we may want to cleanup the already created ones too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also add a unit test for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've squashed this and added a new unit test.

We were previously failing to handle several error return cases which
would result in leaking ports. The defer() method of cleanup is
idiomatic, will handle the existing missing cases, and also the volume
cleanup case which is about to be added.
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 30, 2021
@mdbooth
Copy link
Contributor Author

mdbooth commented Nov 30, 2021

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 30, 2021
@hidekazuna
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hidekazuna, mdbooth

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 1, 2021
@macaptain
Copy link
Contributor

Nice!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 1, 2021
@k8s-ci-robot k8s-ci-robot merged commit 244d31b into kubernetes-sigs:main Dec 1, 2021
@mdbooth mdbooth mentioned this pull request Feb 7, 2022
@pierreprinetti pierreprinetti deleted the cleanup-ports branch February 25, 2022 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CreateInstance leaks ports on certain error conditions
5 participants