The Agent/Gateway healthy condition is more reliable #1236

a-thaler · 2024-07-02T13:21:52Z

Description
As part of #728 the user should be able to define alerts on the kyma module status. At the moment, there are situations where the telemetry module state is "unhealthy" by design while there is no unhealthiness. That happens mainly in upgrade procedures in node eviction situations where gracefully pods are getting replaced. The rollout is non-disruptive, but the module state indicates a disruption. and an alert will be fired.

The goal is to realize that situations differently so that the module gets "unhealthy" only in problematic situations where the user should react.

Hereby, the gateway/agentHealthy condition needs to be improved by not just checking if the pods are running, but checking for unhealthy conditions instead.

Criterias

A regular update of the module should not result in a module state change
A regular pod reschedulling caused for example by a node upgrade should not result in a module state change
At definition of the first pipeline resource, the state should stay healthy till the agents are up or are in a bad state
If an agent cannot come up (startup error, OOM, waiting for volume mount for too long), a problem gets indicated

Hints
The manager should not look anymore to the desired vs available replicas, instead it should check if a minimal amount of pods is available and if all pods are in a healthy state. We might have to watch the pods of the components additionally in order to react to pod status changes.

a-thaler · 2024-07-22T11:57:42Z

We agreed to give a better transparancy to the user by introducing a dedicated reason for the situation where not all pods are ready. As we interpret that situation as "healthy", the value will reflect a positive situation.

Agreed status:

all desired pods are running -> healthy, 1
Some pods are not ready but have no known bad state -> rolloutInProgress, 1
a pod is in a known bad state or not scheduled at all -> unhealthy, 0

rakesh-garimella · 2024-08-05T09:40:19Z

The status was implemented as in the comment above.

a-thaler added kind/feature Categorizes issue or PR as related to a new feature. area/manager Manager or module changes labels Jul 2, 2024

a-thaler mentioned this issue Jul 1, 2024

Telemetry module status as metric input to enable dashboarding and alerting on it #728

Closed

15 tasks

rakesh-garimella self-assigned this Jul 8, 2024

rakesh-garimella mentioned this issue Jul 17, 2024

feat: Improve agent/gateway status detection #1275

Merged

5 tasks

rakesh-garimella closed this as completed Aug 5, 2024

rakesh-garimella mentioned this issue Aug 5, 2024

fix: Add a new reason for the pipelines #1322

Merged

5 tasks

skhalash mentioned this issue Aug 6, 2024

chore: Move from mocks to stubs #1324

Merged

5 tasks

a-thaler added this to the 1.22.0 milestone Aug 19, 2024

skhalash mentioned this issue Oct 25, 2024

Add an E2E testing ensuring that rolling upgrade does not make pipelines unhealthy #1566

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Agent/Gateway healthy condition is more reliable #1236

The Agent/Gateway healthy condition is more reliable #1236

a-thaler commented Jul 2, 2024 •

edited

Loading

a-thaler commented Jul 22, 2024

rakesh-garimella commented Aug 5, 2024

The Agent/Gateway healthy condition is more reliable #1236

The Agent/Gateway healthy condition is more reliable #1236

Comments

a-thaler commented Jul 2, 2024 • edited Loading

a-thaler commented Jul 22, 2024

rakesh-garimella commented Aug 5, 2024

a-thaler commented Jul 2, 2024 •

edited

Loading