Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 [capd] Ensure Loadbalancer IP is not empty #4398

Merged

Conversation

ashish-amarnath
Copy link
Contributor

@ashish-amarnath ashish-amarnath commented Mar 29, 2021

Signed-off-by: Ashish Amarnath [email protected]

What this PR does / why we need it:

DockerCluster reconciliation tries to lookup IP of the loadbalancer container. In the process, it looks at all containers in the system by running docker ps -a filtering using the name of the cluster. This picks up even those containers that were stopped. The stopped containers will not have any IP addresses associated with them. This results in an error in the DockerCluster controller like

[manager] E0329 14:51:58.051663      10 controller.go:302] controller-runtime/manager/controller/dockercluster "msg"="Reconciler error" "error"="DockerCluster.infrastructure.cluster.x-k8s.io \"my-cluster\" is invalid: spec.controlPlaneEndpoint.host: Required value" "name"="my-cluster" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerCluster" 

This change adds a filter to the docker ps -a command to pick up only those containers that are running, and also returns an error if the LB has no IP addresses associated with it.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4396

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 29, 2021
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 29, 2021
@ashish-amarnath ashish-amarnath changed the title 🐛 [WIP] [capd] Filter running LB containers for DockerCluster 🐛 [capd] Filter running LB containers for DockerCluster Mar 29, 2021
@ashish-amarnath
Copy link
Contributor Author

not sure how to kick the netlify checks

@@ -83,6 +83,9 @@ func (n *Node) IP(ctx context.Context) (ipv4 string, ipv6 string, err error) {
if len(ips) != 2 {
return "", "", errors.Errorf("container addresses should have 2 values, got %d values", len(ips))
}
if ips[0] == "" && ips[1] == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems unrelated to the LB container change to me. Am I missing something obvious?

Copy link
Member

@sbueringer sbueringer Mar 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case a change here was intended. Do we want to return an error if:

  • both are empty
  • one of them is empty

Copy link
Contributor Author

@ashish-amarnath ashish-amarnath Mar 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcelMue and @elmiko This change is related to the LB container.
During DockerCluster reconciliation:

  1. In the reconcileNormal method we try to look up the IP of the LB of the previously provisioned LB by calling the LoadBalancer.IP
  2. LoadBalancer.IP in turn calls the Node.IP function by calling s.container.IP()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbueringer the IPs returned here are IPv4 (ips[0]) and IPv6(ips[1] addresses. Till we decide that only one of those is what we will support, returning an error if both are empty seems reasonable. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On thinking about this further, returning this error from the Node.IP() may not be the correct thing. IMO, this check should be performed in the LoadBalancer.IP function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashish-amarnath Sounds okay to me. I don't have the necessary context to know if we expect both of them to be set or just one of them.

@ashish-amarnath
Copy link
Contributor Author

/assign @fabriziopandini

@ashish-amarnath ashish-amarnath force-pushed the lb-status-filter branch 2 times, most recently from 224fd1d to 29ee27c Compare March 30, 2021 16:20
Copy link
Contributor

@MarcelMue MarcelMue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

test/infrastructure/docker/docker/loadbalancer.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 30, 2021
@fabriziopandini
Copy link
Member

fabriziopandini commented Mar 30, 2021

@ashish-amarnath changes lgtm to me but I have two concerns:

  • What is the use case that lead to having a stopped load balancer container, this should never happen
  • What is the expected behaviour in case a stopped load balancer container exists; I understand that with this PR we are not considering it as an existing load balancer, but if I'm not wrong, by ignoring it we ends up in trying to create a new one and this will fail because a container with the same name already exists...

@ashish-amarnath
Copy link
Contributor Author

@fabriziopandini
I am not entirely sure how I ended up with a stopped LB container for my cluster. But the error message I saw in the controller when this happened wasn't indicative of the reason. Specifically, this is the error I observed:

[manager] E0329 14:51:58.051663      10 controller.go:302] controller-runtime/manager/controller/dockercluster "msg"="Reconciler error" "error"="DockerCluster.infrastructure.cluster.x-k8s.io \"my-cluster\" is invalid: spec.controlPlaneEndpoint.host: Required value" "name"="my-cluster" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerCluster" 

Your concerns are valid. First, I agree that this should not happen commonly. But in case this happens, at the very least the needs to be more meaningful than indicating that there was something wrong with the spec that was applied and accepted by the validation. Second, if this were a real provider, then in the case of the LB being stopped, killed, or deleted, the expectation would be that a new one will be spun up in its place as part of reconciling the infrastructure that makes up the cluster. That said, I also understand that this is not a real provider for production.
About reconciliation of the LB failing bc a container of the same name exists, I think that should be remediated too. This can be done by removing stopped containers with the same name. Happy to address that in this PR if you agree 🙂
WDYT?

@fabriziopandini
Copy link
Member

fabriziopandini commented Mar 31, 2021

TBH, I'm a little bit concerned that automatic remediation of stopped containers would hide the root cause of the problem and in the end introduce more instability to the system due to the load balancer going away and being recreated in an unpredictable way (also in a real production system, you can't really trust a load balancer that suddently goes away without reasons).

However I agree we should report a better error message, but this should be done without ignoring stopped containers.

@sbueringer
Copy link
Member

sbueringer commented Mar 31, 2021

I agree with @fabriziopandini . Let's try to first improve error reporting / logging. If we then know the root cause, we can decide if an automatic remediation is the right way to resolve it.

I opened PR #4414 to gather more data in ci. So if we hit the issue there, it should be easier to find out what leads to this problem.

@ashish-amarnath
Copy link
Contributor Author

@fabriziopandini Considering that this is not a real provider but one to catch problems, I agree that this change will be papering over real issues. I will remove the filtering change and keep the error check which could give us the meaningful error message that we are looking for.

@ashish-amarnath ashish-amarnath changed the title 🐛 [capd] Filter running LB containers for DockerCluster 🐛 [capd] Ensure Loadbalancer IP is not empty Mar 31, 2021
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 31, 2021
@ashish-amarnath ashish-amarnath force-pushed the lb-status-filter branch 3 times, most recently from 282bf6f to 2dfdc99 Compare March 31, 2021 15:00
@sbueringer
Copy link
Member

/lgtm

@ashish-amarnath ashish-amarnath changed the title 🐛 [capd] Ensure Loadbalancer IP is not empty 🌱 [capd] Ensure Loadbalancer IP is not empty Mar 31, 2021
@ashish-amarnath
Copy link
Contributor Author

Realized we are not categorizing this as a bug

@fabriziopandini
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 31, 2021
@MarcelMue
Copy link
Contributor

/lgtm

@fabriziopandini
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 2, 2021
@sbueringer
Copy link
Member

sbueringer commented Apr 2, 2021

/test pull-cluster-api-test-main

EDIT: faster :)

@fabriziopandini
Copy link
Member

/retest

@k8s-ci-robot k8s-ci-robot merged commit 00b2032 into kubernetes-sigs:master Apr 2, 2021
@k8s-ci-robot k8s-ci-robot added this to the v0.4 milestone Apr 2, 2021
@ashish-amarnath ashish-amarnath deleted the lb-status-filter branch April 2, 2021 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[capd] Reconciliation of DockerClusters fails when there is a stopped ha-proxy container.
5 participants