Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor logging to give more visibility to check and heal services #533

Merged
merged 3 commits into from
Dec 1, 2022

Conversation

ese
Copy link
Member

@ese ese commented Dec 1, 2022

Description

In order to diagnose issues with the operator is more relevant to know what is happening with check and heal. We were logging mostly about kubernetes service performing updates in kubernetes objects which usually is not relevant once the cluster is bootstrapped.

  • Change some messages to give more clear information
  • Demote kubernetes objects updates to debug devel
  • Promote relevant messages in checker to Info and Warning levels
  • Add namespace and redisfailover context to checker messages
  • Before applying check and heal wait for all expected pods up and running instead wait only for exists to let Kubernetes controllers do their job

@ese ese requested a review from a team as a code owner December 1, 2022 08:25
@ese ese merged commit 3ceb1b2 into master Dec 1, 2022
@ese ese deleted the logging branch December 1, 2022 15:44
@samof76
Copy link
Contributor

samof76 commented Dec 5, 2022

@ese there seems to be inherent issue with this

Before applying check and heal wait for all expected pods up and running instead wait only for exists to let Kubernetes controllers do their job

Consider this scenario....

  1. The master and sentinel pods are running
  2. The master pod and setinels get killed.
  3. Those pods are unable to get scheduled.

In this case the check-and-heal would not do what its intended to do.

Consider another scenario...

  1. The master and sentinel pods are running
  2. All of the sentinel get killed and along with a slave
  3. Now the sentinels get scheduled but on slave is still node scheduled.

In this case the check-and-heal would not configure the sentinels because of the fix here.

@samof76 samof76 mentioned this pull request Dec 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants