[BUG] flytepropeller fails trying to get pod resource using the kubeClient #4730

andresgomezfrr · 2024-01-16T11:26:19Z

Describe the bug

Yesterday we upgraded the propeller to the version 1.10.6 and today we noticed that we have a lot of pods with the Terminating state in our cluster. After some checks we detected that we started to see this log many times

flyte/flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go

Line 269 in 38d1833

    
           logger.Warningf(ctx, "Failed to find the Resource with name: %v. Error: %v", nsName, err)

before the upgraded we didn't have any logs like this.

We deleted all the pods manually and downgraded the version and the logs disappeared, now the pods are deleted again without issues.

Expected behavior

The pods should be deleted properly and not stuck in the Terminating state.

Additional context to reproduce

Upgrade to the latest version of the propeller and execute pods.

Screenshots

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

dosubot · 2024-01-16T11:26:43Z

🤖

Hello @andresgomezfrr! I'm Dosu, a friendly bot here to assist you with your Flyte issues and queries. I'm here to help you while we wait for a human maintainer to join us. Don't worry, together we'll figure this out!

Thank you for reporting this issue. It seems like you're experiencing a problem similar to a previously resolved issue in the Flyte repository: [BUG] Pods stuck on Terminating with finalizer. This issue was resolved by a pull request: Abort subworkflow on subnode failure.

However, since you're using a newer version of the propeller (1.10.6), it's possible that the problem has resurfaced due to changes made in this version. Here are some changes that were made in the propeller version 1.10.6 that could potentially cause the pods to get stuck in the Terminating state:

"Moved controller-runtime start out of webhook Run function" - PR#546
"Fixing recovering of SKIPPED nodes" - PR#551
"Remove resource injection on the node for container task" - PR#544
"Remove BarrierTick" - PR#545
"Check for TerminateExecution error and eat Precondition status" - PR#553
"Setting primaryContainerName by default on Pod plugin" - PR#555
"Fast fail if task resource requests exceed k8s resource limits" - PR#488

These are potential causes and further investigation would be needed to confirm the exact cause.

As for the error message you're seeing in the logs, it's a generic error message that's logged when the propeller fails to get the pod resource using the kubeClient. The actual error details are provided by the "err" variable in the Go code, which is not visible in the log message you provided.

To help us investigate this issue further, could you please provide more details about your environment? Specifically, it would be helpful to know:

The Kubernetes version you're using
The configuration of your Flyte deployment
Any customizations you've made to the default Flyte configuration
The exact error message(s) you're seeing in the logs, including the values of the "err" variable and other context-specific variables

This information will help us understand the context in which the issue is occurring and potentially identify the cause. Thank you for your cooperation.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

andresgomezfrr added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Jan 16, 2024

hamersaw added exo backlogged For internal use. Reserved for contributor team workflow. and removed untriaged This issues has not yet been looked at by the Maintainers labels Jan 16, 2024

hamersaw mentioned this issue Jan 17, 2024

Reintroduce k8s client fallback to cache lookups #4733

Merged

3 tasks

hamersaw closed this as completed in #4733 Jan 17, 2024

hamersaw mentioned this issue Jan 22, 2024

Wrapping k8s client with write filter and cache reader #4752

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] flytepropeller fails trying to get pod resource using the kubeClient #4730

[BUG] flytepropeller fails trying to get pod resource using the kubeClient #4730

andresgomezfrr commented Jan 16, 2024

dosubot bot commented Jan 16, 2024 •

edited

Loading

[BUG] flytepropeller fails trying to get pod resource using the kubeClient #4730

[BUG] flytepropeller fails trying to get pod resource using the kubeClient #4730

Comments

andresgomezfrr commented Jan 16, 2024

Describe the bug

Expected behavior

Additional context to reproduce

Screenshots

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

dosubot bot commented Jan 16, 2024 • edited Loading

Sources

dosubot bot commented Jan 16, 2024 •

edited

Loading