Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Change pod readiness check mechanism #249

Merged
merged 4 commits into from
Oct 12, 2024

Conversation

lukapetrovic-git
Copy link
Contributor

@lukapetrovic-git lukapetrovic-git commented Sep 16, 2024

Description

If needed i can open an issue for this as well. The following happens:

The check if all pods are ready fails in some of my clusters due to grep catching pods that it is not supposed to, for example:

image

In the situation above i have pods that as part of their name have init, and the check never passes, so i changed it to check the metadata of the pod itself and figure out its phase https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Small minor change not affecting the Ansible Role code (GitHub Actions Workflow, Documentation etc.)

How Has This Been Tested?

Tested on Ubuntu 22.04, RKE2 v1.27.12+rke2r1 on a dev cluster and one production cluster where the problems were happening.

@lukapetrovic-git lukapetrovic-git marked this pull request as ready for review September 16, 2024 14:14
@lukapetrovic-git lukapetrovic-git changed the title Change pod readiness check mechanism fix: Change pod readiness check mechanism Sep 16, 2024
args:
executable: /bin/bash
failed_when: "all_pods_ready.rc not in [ 0, 1 ]"
failed_when: "all_pods_ready.rc != 0"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why return code 1 was considered ok here, so i made this change, if there is something im not seeing, please comment @MonolithProjects

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was like that because if grep found no matches (all pods in Ready state), the command would return 1 and the task would fail. But with your approach it's fine to change this. I guess now you can even remove this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@lukapetrovic-git
Copy link
Contributor Author

lukapetrovic-git commented Sep 27, 2024

Another question i have regarding this task, why are pods running in kube-system exempt from the check (metadata.namespace!=kube-system)?
One example: When RKE2 service is restarted in my case Cilium pods also get restarted, they run in the kube-system ns and are crucial to the functioning of the cluster as a whole. Cheers!

args:
executable: /bin/bash
failed_when: "all_pods_ready.rc not in [ 0, 1 ]"
failed_when: "all_pods_ready.rc != 0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was like that because if grep found no matches (all pods in Ready state), the command would return 1 and the task would fail. But with your approach it's fine to change this. I guess now you can even remove this line.

args:
executable: /bin/bash
failed_when: "all_pods_ready.rc not in [ 0, 1 ]"
failed_when: "all_pods_ready.rc != 0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@MonolithProjects
Copy link
Collaborator

Another question i have regarding this task, why are pods running in kube-system exempt from the check (metadata.namespace!=kube-system)? One example: When RKE2 service is restarted in my case Cilium pods also get restarted, they run in the kube-system ns and are crucial to the functioning of the cluster as a whole. Cheers!

Hmm actually this is something i overlooked. It does not make much sense to me to exclude the pods in this namespace from the check.

tasks/change_config.yml Outdated Show resolved Hide resolved
tasks/rolling_restart.yml Outdated Show resolved Hide resolved
@MonolithProjects MonolithProjects added the bug Something isn't working label Oct 6, 2024
@lukapetrovic-git
Copy link
Contributor Author

Thanks for the review, i made the changes that were requested. I don't work too much with Github, so i'm not sure if i should resolve the changes you requested. Cheers!

@MonolithProjects
Copy link
Collaborator

Thanks for the review, i made the changes that were requested. I don't work too much with Github, so i'm not sure if i should resolve the changes you requested. Cheers!

That's fine, i will do it. Thanks!

Copy link
Collaborator

@MonolithProjects MonolithProjects left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MonolithProjects MonolithProjects merged commit a77f6fd into lablabs:main Oct 12, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants