Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Agent/Gateway healthy condition is more reliable #1236

Closed
a-thaler opened this issue Jul 2, 2024 · 2 comments
Closed

The Agent/Gateway healthy condition is more reliable #1236

a-thaler opened this issue Jul 2, 2024 · 2 comments
Assignees
Labels
area/manager Manager or module changes kind/feature Categorizes issue or PR as related to a new feature.
Milestone

Comments

@a-thaler
Copy link
Collaborator

a-thaler commented Jul 2, 2024

Description
As part of #728 the user should be able to define alerts on the kyma module status. At the moment, there are situations where the telemetry module state is "unhealthy" by design while there is no unhealthiness. That happens mainly in upgrade procedures in node eviction situations where gracefully pods are getting replaced. The rollout is non-disruptive, but the module state indicates a disruption. and an alert will be fired.

The goal is to realize that situations differently so that the module gets "unhealthy" only in problematic situations where the user should react.

Hereby, the gateway/agentHealthy condition needs to be improved by not just checking if the pods are running, but checking for unhealthy conditions instead.

Criterias

  • A regular update of the module should not result in a module state change
  • A regular pod reschedulling caused for example by a node upgrade should not result in a module state change
  • At definition of the first pipeline resource, the state should stay healthy till the agents are up or are in a bad state
  • If an agent cannot come up (startup error, OOM, waiting for volume mount for too long), a problem gets indicated

Hints
The manager should not look anymore to the desired vs available replicas, instead it should check if a minimal amount of pods is available and if all pods are in a healthy state. We might have to watch the pods of the components additionally in order to react to pod status changes.

@a-thaler a-thaler added kind/feature Categorizes issue or PR as related to a new feature. area/manager Manager or module changes labels Jul 2, 2024
@rakesh-garimella rakesh-garimella self-assigned this Jul 8, 2024
@a-thaler
Copy link
Collaborator Author

We agreed to give a better transparancy to the user by introducing a dedicated reason for the situation where not all pods are ready. As we interpret that situation as "healthy", the value will reflect a positive situation.

Agreed status:

  • all desired pods are running -> healthy, 1
  • Some pods are not ready but have no known bad state -> rolloutInProgress, 1
  • a pod is in a known bad state or not scheduled at all -> unhealthy, 0

@rakesh-garimella
Copy link
Contributor

The status was implemented as in the comment above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/manager Manager or module changes kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

2 participants