LogPipeline AgentHealthy status is flaky at agent rollout #1545

a-thaler · 2024-10-22T07:51:58Z

Description

I observed a fluentbit rollout which caused a AgentNotReady reason for one datapoint only. Before that there was one datapoint of RolloutInProgress. As the rollout status was only one point, I assume there was no timeout scenario.

The related log in the manager:

{"level":"ERROR","timestamp":"2024-10-21T04:39:31Z","caller":"commonstatus/checker.go:76","message":"Failed to probe agent - set condition as not healthy","controller":"logpipeline","controllerGroup":"telemetry.kyma-project.io","controllerKind":"LogPipeline","LogPipeline":{"name":"cls"},"namespace":"","name":"cls","reconcileID":"5f882c60-d56f-4113-8986-c4a111b0ca54","error":"Pod has failed: "}

Usually the manager is looking for the status.message field and it looks like unset. Otherwise we should have seen some message here. Pod has failed: "

Feedback from @rakesh-garimella:
There api has two fields that can be set Reason and Message both are optional. I was setting Message till now. I will also add Reason here to see more info. May be this is set

[types.go](https://github.com/kubernetes/api/blob/master/core/v1/types.go)
    Message string `json:"message,omitempty" protobuf:"bytes,3,opt,name=message"`
    // A brief CamelCase message indicating details about why the pod is in this state.
    // e.g. 'Evicted'
    // +optional
    Reason string `json:"reason,omitempty" protobuf:"bytes,4,opt,name=reason"`

probably printing all conditions would also be useful. Need to think how to incorporate this in the code

Expected result

Actual result

Steps to reproduce

Troubleshooting

Release Notes

The text was updated successfully, but these errors were encountered:

skhalash · 2024-10-25T11:24:34Z

The problem is fixed, but we need to add an E2E test to make sure it won't happen in the future: #1566

a-thaler added kind/bug Categorizes issue or PR as related to a bug. area/logs LogPipeline labels Oct 22, 2024

skhalash self-assigned this Oct 22, 2024

This was referenced Oct 23, 2024

fix: Implement graceful shutdown kyma-project/directory-size-exporter#79

Merged

chore: Bump directory-size-exporter image #1558

Merged

Add an E2E testing ensuring that rolling upgrade does not make pipelines unhealthy #1566

Closed

skhalash closed this as completed Oct 25, 2024

a-thaler added this to the 1.27.0 milestone Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LogPipeline AgentHealthy status is flaky at agent rollout #1545

LogPipeline AgentHealthy status is flaky at agent rollout #1545

a-thaler commented Oct 22, 2024

skhalash commented Oct 25, 2024

LogPipeline AgentHealthy status is flaky at agent rollout #1545

LogPipeline AgentHealthy status is flaky at agent rollout #1545

Comments

a-thaler commented Oct 22, 2024

skhalash commented Oct 25, 2024