Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Fix random instance port deletion #1753

Merged
merged 1 commit into from
Jan 12, 2024

Conversation

zioc
Copy link
Contributor

@zioc zioc commented Nov 22, 2023

What this PR does / why we need it:

Under certain circumstances, when they are non-responsive cells, nova may return a partial result instead of an error.

(this is controlled by list_records_by_skipping_down_cells parameter that is true by default - no error will be thrown)

When it faces this issue, capo controller will try to create a new server for the machine, resulting in deleting ports of existing one.

In order avoid this issue, use GET instead of LIST to retrieve server using its uuid. For that purpose we must ensure that uuid is stored in machine spec as soon as the server is created, even if server is in error state.

Fixes: #1612

Special notes for your reviewer:

  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Yes

TODOs:

  • [ X] squashed commits
  • if necessary:
    • includes documentation
    • adds unit tests

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 22, 2023
Copy link

linux-foundation-easycla bot commented Nov 22, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: zioc / name: Francois Eleouet (75ffe73)

Copy link

netlify bot commented Nov 22, 2023

Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name Link
🔨 Latest commit 75ffe73
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/65a0f9bd3a9c9a0008d438ef
😎 Deploy Preview https://deploy-preview-1753--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Nov 22, 2023
@k8s-ci-robot
Copy link
Contributor

Welcome @zioc!

It looks like this is your first PR to kubernetes-sigs/cluster-api-provider-openstack 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-provider-openstack has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 22, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @zioc. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 22, 2023
@jichenjc
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 23, 2023
@jichenjc
Copy link
Contributor

@mdbooth maybe need your help on this to see fit for your original idea

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 23, 2023
@zioc
Copy link
Contributor Author

zioc commented Nov 23, 2023

I revisited to patch to remove the polling as suggested in the issue.

As a consequence, reconcileBastion has to be revisited to requeue until bastion server gets ready. If the approach looks good to you I'll work on adding some tests to controllers.

As instance status check is now moved to the main reconcile loop while the instance is in BUILDING state, its status will only be checked every minute, whereas retryInterval was 10s in CreateInstance polling. I was wondering if it would be acceptable to requeue sooner (in 10s like before for example) while the instance is still in BUILDING state?

@zioc zioc force-pushed the get-server-by-id branch 2 times, most recently from c940ccd to e185368 Compare November 27, 2023 12:20
Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great: I'm very much in favour of this!

The e2e test failure looks real, though. Something odd is going on with Bastion creation. It looks like we end up creating bastions continuously until devstack runs out of quota.

To take a look, have a look at the e2e job failure in the PR. At the top you'll see 'Artifacts' which link to everything we dump out of the test run. Some interesting logs:

There's one other factor to consider here from a design POV: we need to handle the case where we fail to write the machine/bastion ID to the status. It turns out it is surprisingly easy for this to happen due to the way we patch objects, but we should also consider that brief outages of the k8s control plane are expected from time to time.

My solution to this to date has been that if the ID is not stored we always check by name, and adopt any resource we find. This solution has its own edge cases, but I think it's a lesser evil.

One more thing, and I mention this only because it looks like you know what you're doing and appreciate clean controllers. Feel free to ignore this entirely if the scope creep is too much:

I would really like to move machine creation out of the cluster controller, but we haven't gotten around to it yet. Instead of calling directly into CreateInstance, I would like to see the cluster controller create a Machine/OpenStackMachine for the Bastion and watch it. That would eliminate a bunch of cruft and duplication, and enable more robustification of the machine cleanup flow.

Again, only if you're up for it. It would make this cleanup simpler, but I'm pretty sure the total work involved would be greater.

@@ -379,6 +383,9 @@ func (r *OpenStackMachineReconciler) reconcileNormal(ctx context.Context, scope
scope.Logger().Info("Machine instance state is DELETED, no actions")
conditions.MarkFalse(openStackMachine, infrav1.InstanceReadyCondition, infrav1.InstanceDeletedReason, clusterv1.ConditionSeverityError, "")
return ctrl.Result{}, nil
case infrav1.InstanceStateBuilding:
scope.Logger().Info("Waiting for instance to become ACTIVE", "id", instanceStatus.ID(), "status", instanceStatus.State())
return ctrl.Result{RequeueAfter: waitForBuildingInstanceToReconcile}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I think it's sensible to have a different requeue after here: this is an expected state which under normal circumstances should resolve quickly. The fall-through below is 'random other'. We don't want to poll at 10 second intervals for a server that's shutoff.

@zioc
Copy link
Contributor Author

zioc commented Nov 29, 2023

Thanks a lot for the feedback and suggestions @mdbooth , as well as tips to understand e2e failures, they were very helpful.

It is indeed much safer to try to get instance by name when there is no ID saved in resource, I've added that fallback in various places.

While reworking the reconcileBastion method to fix the observed issue, I figured out that current implementation is not really reconcilling bastion floating IP as it exits as soon as instance hash has been saved in annotations.

With current implementation, if floating IP fails to be attached to the instance (it can be the case if we provide a cluster network without router), it will be creating floating IPs in loop (similarly to what has been described in this issue.

I tried to revisit that part to make it more robust and save allocated floatingIP in cluster status. The proposed patch is functional in most cases, but there is a race condition that sometimes leads to the creation of two floating IPs. It is very likely to be caused by "the way we patch objects" as you said: when bastion is created, its object is patched before the status subresource as we are adding the bastionHash annotation. Consequently cluster may reconcile without updated status and create a new floatingIP.
Unfortunately, as floatingIPs objects don't have names, we can't check by name as a fallback as we do for Instances (we could use description for that purpose, but it is not very clean...).
Other option would be to patch resources status as soon as we want to store some identifier (like instance IDs or floating IPs), this way we would have less chances to face this kind of race conditions.
Last option (and probably the most reasonable one) will be to avoid revisiting the floating IP attachment part (it probably deserves its own PR to keep this one simple), but I felt it was worth sharing this patch anyway for the shake of discussion.

Regarding your suggestion to use Machine/OpenstackMachine for bastion, it seems very interesting at first glance, but there are several points that remains unclear to me:

  • I'm not sure if it's very clean to have a machine that does not belong to any controlplane or machine deployment from a capi controller perspective, as it expects to find a corresponding node in kubernetes at some point.
  • As a large part of the code in reconcile/deleteBastion is handling floatingIP attachment and status update, we would still have to handle that part even if we were using an openstackMachine to control the openstack instance. I'm wondering if we would be sharing that much code at the end (but I may be missing something)

@zioc zioc force-pushed the get-server-by-id branch from d207046 to 398e443 Compare December 1, 2023 08:54
Copy link
Contributor Author

@zioc zioc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally I couldn't manage to avoid an important refactoring of reconcileBastion part. The race condition I was referring to in my previous message is fixed by checking if bastion already has a floating IP attached. (The race may still occur if floating IP failed to be attached, so we may create 2 floating IP in some cases, but it's already much better than current code that will create as many floating IP as it can if it fails to attach them for some reason).

@@ -116,7 +116,6 @@ func (is *InstanceStatus) BastionStatus(openStackCluster *infrav1.OpenStackClust

clusterNetwork := openStackCluster.Status.Network.Name
i.IP = ns.IP(clusterNetwork)
i.FloatingIP = ns.FloatingIP(clusterNetwork)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we were persisting the floating IP in status only if it was attached to the bastion VM. In order to properly track the created floating IP, it is now saved independently in reconcileBastion

// InstanceStateBuilding is the string representing an instance in a building state.
InstanceStateBuilding = InstanceState("BUILDING")
// InstanceStateBuild is the string representing an instance in a build state.
InstanceStateBuild = InstanceState("BUILD")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it is being created Nova server status is BUILD and not BUILDING. As this variable was not used elsewhere I felt it was ok to change it, please let me know it you foresee any issue to do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure whether there are some consistent
https://github.com/openstack/nova/blob/master/nova/objects/fields.py#L969
seems indicate it's BUILDING

but in the CLI it's BUILD
which makes me a little bit confused, maybe something updated in the nova client?

| ddd0059f-f71e-4b07-9b28-f7344496a713 | amphora-ed18d7f4-e0b8-4f30-98fa-889f4aa0771a | BUILD  | spawning   | NOSTATE     |                                                                |

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VM state seems to be derived from status, and the correspondance is not staightforward, for BUILDING state, we have a BUILD status

// InstanceStateBuilding is the string representing an instance in a building state.
InstanceStateBuilding = InstanceState("BUILDING")
// InstanceStateBuild is the string representing an instance in a build state.
InstanceStateBuild = InstanceState("BUILD")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure whether there are some consistent
https://github.com/openstack/nova/blob/master/nova/objects/fields.py#L969
seems indicate it's BUILDING

but in the CLI it's BUILD
which makes me a little bit confused, maybe something updated in the nova client?

| ddd0059f-f71e-4b07-9b28-f7344496a713 | amphora-ed18d7f4-e0b8-4f30-98fa-889f4aa0771a | BUILD  | spawning   | NOSTATE     |                                                                |

if err != nil {
return err
if openStackCluster.Status.Bastion != nil && openStackCluster.Status.Bastion.FloatingIP != "" {
// Floating IP could have been created but not associated to instance, attempt to delete it from saved status first
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add some log here as it's useful info

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is the regular case (floating IP is stored in status, and deleted here), I haven't removed the other block that removes floating IP attached to bastion, but it will we be useful only if floating IP is was not properly saved in status for some reason. Maybe I should reword the comment to make it clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the removal of floating IP will already be logged when it will be removed at this point, it does not seem very meaningfull to add another log. I changed my comment, hoping that it will be more explicit. Let me know if you want me to revisit that.

@zioc zioc force-pushed the get-server-by-id branch from 398e443 to 19a6cb5 Compare December 1, 2023 11:23
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 7, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mdbooth, zioc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 8, 2024
@zioc zioc force-pushed the get-server-by-id branch from 76f8a5d to b509efa Compare January 8, 2024 18:05
@EmilienM
Copy link
Contributor

EmilienM commented Jan 8, 2024

/test pull-cluster-api-provider-openstack-e2e-full-test

@zioc zioc force-pushed the get-server-by-id branch from b509efa to 6c4f075 Compare January 10, 2024 07:50
@zioc
Copy link
Contributor Author

zioc commented Jan 10, 2024

/approve

This looks great, and especially thanks for writing good tests with good descriptions.

I've approved, but I'd appreciated it if you could clear up the portID thing in the tests. It works, but it raised a flag for me that detracted from the clarity of the tests.

Thanks for the review @mdbooth I've updated the tests following your suggestion, and rebased on top of latest changes as there were merge conflicts.

@jichenjc could you please have a look too? It'd be really great to get rid of this issue!

@EmilienM
Copy link
Contributor

/test pull-cluster-api-provider-openstack-e2e-full-test

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 11, 2024
@jichenjc
Copy link
Contributor

@zioc can you help address above comments and solve teh conflict ? Thanks

Under certain circumstances, when they are non-responsive cells, nova
may return a partial result instead of an error.

(this is controlled by list_records_by_skipping_down_cells parameter
that is true by default - no error will be thrown)

When it faces this issue, capo controller will try to create a new
server for the machine, resulting in deleting ports of existing one.

In order avoid this issue, use GET instead of LIST with server uuid stored
in machine spec when it is created, and store server ID in machine spec
immediately.

For that purpose server status polling that used to take place in
CreateMachine has to be moved to main reconcile loop. ReconcileBastion
also needed to be revisited to properly support requeuing.
@zioc zioc force-pushed the get-server-by-id branch from 6c4f075 to 75ffe73 Compare January 12, 2024 08:35
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 12, 2024
@zioc
Copy link
Contributor Author

zioc commented Jan 12, 2024

@zioc can you help address above comments and solve teh conflict ? Thanks

@jichenjc I've rebased on top of latest change.

The comments have already been addressed, and the proposed PR is in line with the outcome of various discussions to my opinion. However, I didn't feel comfortable clicking on "Resolve conversation" as I wasn't the one who initiated them, should I?

@mdbooth
Copy link
Contributor

mdbooth commented Jan 12, 2024

FWIW, all of my comments are addressed 👍

@jichenjc
Copy link
Contributor

/lgtm
/hold cancel

looks good to me ,thanks~

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jan 12, 2024
@zioc
Copy link
Contributor Author

zioc commented Jan 12, 2024

thanks!

/retest

@lentzi90
Copy link
Contributor

/test pull-cluster-api-provider-openstack-e2e-test

@k8s-ci-robot k8s-ci-robot merged commit 9510304 into kubernetes-sigs:main Jan 12, 2024
9 checks passed
@mnaser
Copy link
Contributor

mnaser commented Jan 15, 2024

/cherry-pick release-0.9

@mnaser
Copy link
Contributor

mnaser commented Jan 15, 2024

/cherry-pick release-0.8

@k8s-infra-cherrypick-robot

@mnaser: #1753 failed to apply on top of branch "release-0.9":

Applying: Use get instead of list to retrieve servers
Using index info to reconstruct a base tree...
A	api/v1alpha8/types.go
M	controllers/openstackcluster_controller.go
M	controllers/openstackcluster_controller_test.go
M	controllers/openstackmachine_controller.go
M	controllers/suite_test.go
M	pkg/cloud/services/compute/instance.go
M	pkg/cloud/services/compute/instance_test.go
M	pkg/cloud/services/compute/instance_types.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/cloud/services/compute/instance_types.go
Auto-merging pkg/cloud/services/compute/instance_test.go
Auto-merging pkg/cloud/services/compute/instance.go
Auto-merging controllers/suite_test.go
CONFLICT (content): Merge conflict in controllers/suite_test.go
Auto-merging controllers/openstackmachine_controller.go
Auto-merging controllers/openstackcluster_controller_test.go
CONFLICT (content): Merge conflict in controllers/openstackcluster_controller_test.go
Auto-merging controllers/openstackcluster_controller.go
CONFLICT (modify/delete): api/v1alpha8/types.go deleted in HEAD and modified in Use get instead of list to retrieve servers. Version Use get instead of list to retrieve servers of api/v1alpha8/types.go left in tree.
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Use get instead of list to retrieve servers
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-0.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-infra-cherrypick-robot

@mnaser: #1753 failed to apply on top of branch "release-0.8":

Applying: Use get instead of list to retrieve servers
Using index info to reconstruct a base tree...
M	api/v1alpha7/types.go
A	api/v1alpha8/types.go
M	controllers/openstackcluster_controller.go
M	controllers/openstackcluster_controller_test.go
M	controllers/openstackmachine_controller.go
M	controllers/suite_test.go
M	pkg/cloud/services/compute/instance.go
M	pkg/cloud/services/compute/instance_test.go
M	pkg/cloud/services/compute/instance_types.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/cloud/services/compute/instance_types.go
Auto-merging pkg/cloud/services/compute/instance_test.go
CONFLICT (content): Merge conflict in pkg/cloud/services/compute/instance_test.go
Auto-merging pkg/cloud/services/compute/instance.go
Auto-merging controllers/suite_test.go
CONFLICT (content): Merge conflict in controllers/suite_test.go
Auto-merging controllers/openstackmachine_controller.go
CONFLICT (content): Merge conflict in controllers/openstackmachine_controller.go
Auto-merging controllers/openstackcluster_controller_test.go
CONFLICT (content): Merge conflict in controllers/openstackcluster_controller_test.go
Auto-merging controllers/openstackcluster_controller.go
CONFLICT (content): Merge conflict in controllers/openstackcluster_controller.go
CONFLICT (modify/delete): api/v1alpha8/types.go deleted in HEAD and modified in Use get instead of list to retrieve servers. Version Use get instead of list to retrieve servers of api/v1alpha8/types.go left in tree.
Auto-merging api/v1alpha7/types.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Use get instead of list to retrieve servers
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-0.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@EmilienM
Copy link
Contributor

EmilienM commented Jun 4, 2024

In this PR, FailedCreateServer error event was removed while it's still being tested:

return isErrorEventExists(namespace.Name, mdInvalidAZName, "FailedCreateServer", azError, eventList)

I have no idea how our CI didn't catch that 🤷 but for sure I'll take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Randomly deleted port for VMs
8 participants