Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Task container terminate with error OOMKilled with exit code 137 #4975

Open
2 tasks done
spacepirate0001 opened this issue Feb 28, 2024 · 6 comments
Open
2 tasks done
Assignees
Labels
bug Something isn't working flyteadmin Issue for FlyteAdmin Service helm waiting for reporter Used for when we need input from the bug reporter

Comments

@spacepirate0001
Copy link

spacepirate0001 commented Feb 28, 2024

Describe the bug

Using inline to override Flyte configuration for task_resources is not working as expected. I added the following configuration to extend value.yaml in flyte-binary chart:

# inline Specify additional configuration or overrides for Flyte, to be merged with the base configuration
  inline:
   task_resources:
    limits:
      cpu: 2
      memory: 3Gi
      ephemeralStorage: 0
      gpu: 0
    defaults:
      cpu: 2
      memory: 3Gi
      ephemeralStorage: 0
      gpu: 0

The added configuration did not yield the expected change on task_resources and task containers yaml showed the following in terms of resource settings:

      name: a4dlhcql89l2zsndtzcp-n1-0
    resources:
      limits:
        cpu: "1"
        memory: 2Gi
      requests:
        cpu: "1"
        memory: 2Gi

Expected behavior

task container resources should reflect the new configurations added to values.yaml i.e.

# inline Specify additional configuration or overrides for Flyte, to be merged with the base configuration
  inline:
   task_resources:
    limits:
      cpu: 2
      memory: 3Gi
      ephemeralStorage: 0
      gpu: 0
    defaults:
      cpu: 2
      memory: 3Gi
      ephemeralStorage: 0
      gpu: 0

should have task container run with the following resources:

      name: a4dlhcql89l2zsndtzcp-n1-0
    resources:
      limits:
        cpu: "2"
        memory: 3Gi
      requests:
        cpu: "1"
        memory: 3Gi

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@spacepirate0001 spacepirate0001 added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Feb 28, 2024
Copy link

dosubot bot commented Feb 28, 2024

🤖

Hey @spacepirate0001, great to see you back! Ran into another quirky adventure, I see? Let's dive into it.

Thank you for providing detailed information about the issue you're facing. It seems like the inline configuration override for task_resources in the Flyte configuration is not reflecting the expected changes in the task container resources. This could be due to several reasons:

  1. Configuration Not Applied: The changes made to the task_resource_defaults.yaml file might not have been applied to the Flyte cluster. Ensure that after modifying the configuration file, you apply the changes using the appropriate command, such as kubectl apply -k kustomize/base/single_cluster/headless/ if you are using Kubernetes.

  2. Configuration File Location: The configuration file must be in the correct location and correctly referenced by the FlyteAdmin service. Ensure that the file path kustomize/base/single_cluster/headless/config/admin/task_resource_defaults.yaml is correctly mounted into the FlyteAdmin pod and that FlyteAdmin is configured to read from this file.

  3. FlyteAdmin Service Restart: After updating the configuration file, the FlyteAdmin service may need to be restarted for the changes to take effect. This is because configuration files are often read at the service startup, and changes made while the service is running are not automatically picked up.

  4. Syntax or Indentation Error: YAML files are sensitive to syntax and indentation. An error or misalignment in the task_resource_defaults.yaml file could cause the configuration to be ignored or not parsed correctly. Ensure that the YAML syntax is correct and follows the required structure.

  5. Overrides at Task Registration: Task definitions can specify their own resource requirements, which override the defaults specified in task_resource_defaults.yaml. If the tasks have their resource requirements set at the time of registration, those values will take precedence over the defaults.

  6. Flyte Version Compatibility: Ensure that the version of Flyte you are using supports the configuration overrides as expected. It's possible that the feature behaves differently or has been updated in different versions of Flyte.

To diagnose and fix the issue, you should:

  • Verify that the configuration changes have been applied correctly.
  • Ensure that the FlyteAdmin service is correctly configured to use the updated configuration file and restart the service if necessary.
  • Check the syntax and indentation of the task_resource_defaults.yaml file.
  • Review the task definitions to ensure they are not specifying their own resource requirements that override the defaults.
  • Confirm that the Flyte version you are using supports the configuration overrides as expected.

I also found a similar solved issue [Docs] Document how to change default platform conf when setting resources for task which might be helpful.

If you're trying to override the resources allocated to tasks at runtime, you can use the with_overrides method to dynamically override the resources allocated to tasks. This is demonstrated in the provided example where the with_overrides method is used to change the resource limits of a task within a workflow. Here's how it's done:

  1. Define your task normally with initial resource requests and limits using the @task decorator and specifying the requests and limits parameters.
  2. In your workflow, when calling the task, use the .with_overrides() method on the task call to specify new resource limits.
  3. Pass the Resources object with the desired cpu and mem overrides to the limits parameter of the with_overrides method.

Here's the relevant code snippet from the context:

@workflow
def my_pipeline(x: typing.List[int]) -> int:
    return square_1(x=count_unique_numbers_1(x=x)).with_overrides(limits=Resources(cpu="6", mem="500Mi"))

In this example, the with_overrides method is used to override the resource limits of the square_1 task call within the my_pipeline workflow, setting the CPU limit to "6" and the memory limit to "500Mi".

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@pingsutw pingsutw added flyteadmin Issue for FlyteAdmin Service helm and removed untriaged This issues has not yet been looked at by the Maintainers labels Feb 29, 2024
@eapolinario eapolinario self-assigned this Feb 29, 2024
@eapolinario
Copy link
Contributor

@spacepirate0001 , how did you update the values in the task resources?

@eapolinario eapolinario added the waiting for reporter Used for when we need input from the bug reporter label Mar 2, 2024
@spacepirate0001
Copy link
Author

@spacepirate0001 , how did you update the values in the task resources?

I first tried to add it to clusterResourceTemplates as follows:

    013_flyte_task_resource_defaults.yaml: |
      task_resources:
        defaults:
          cpu: 2
          memory: 3Gi
          ephemeralStorage: 0
          gpu: 0
        limits:
          cpu: 2
          memory: 3Gi
          ephemeralStorage: 0
          gpu: 0  

It did not work then I added it as configuration.inline:

#inline Specify additional configuration or overrides for Flyte, to be merged with the base configuration
  inline:  
    task_resources:
      limits:
        cpu: 2
        memory: 3Gi
        ephemeralStorage: 0
        gpu: 0
      defaults:
        cpu: 2
        memory: 3Gi
        ephemeralStorage: 0
        gpu: 0

This partially worked as new task_resource run with the following:

resources:
      limits:
        cpu: "1"
        memory: 2Gi
      requests:
        cpu: "1"
        memory: 2Gi

My tasks need more resources which end up with the error OOMKilled with exit code 137

@eapolinario
Copy link
Contributor

@spacepirate0001 , how are you updating the values? Can you share the commands you used? Also, can you confirm which requests and limits values, if any, you were using in the tasks you tested on?

@spacepirate0001
Copy link
Author

Code is deployed via terraform module updates and I can see the changes I make being reflected on the manifest. You should try the same in your setup and see that the values don’t change beyond what I’ve mentioned. Finally I’m running flyte-binary chart in which I did not find values for task_resources at all.

@cjidboon94
Copy link
Contributor

cjidboon94 commented Mar 20, 2024

In my flyte-binary values I've set as suggested here https://github.com/davidmirror-ops/flyte-the-hard-way/blob/main/docs/aws/05-deploy-with-helm.md#time-for-helm

configuration:
  inline: 
    task_resources:
     defaults:
        cpu: 500m
        memory: 500Mi
        storage: 500Mi
      limits:
        cpu: "10"
        memory: 20Gi

However when I try to register a workflow with tasks that has a task with a limit set to cpu=4, I get the following error response:

USER:BadInputToAPI: error=None, cause=<_InactiveRpcError of RPC that terminated with:
        status = StatusCode.INVALID_ARGUMENT
        details = "Requested CPU limit [4] is greater than current limit set in the platform configuration [2]. Please contact Flyte Admins to change these limits or consult the configuration"
        debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Requested CPU limit [4] is greater than current limit set in the platform configuration [2]. Please contact Flyte Admins to change these limits or consult the configuration", grpc_status:3, created_time:"2024-03-20T11:29:01.232266149+01:00"}"

If I run flytectl get task-resource-attribute -p flytesnacks -d development, I get
{"project":"flytesnacks","domain":"development","defaults":{"cpu":"1","memory":"150Mi"},"limits":{"cpu":"2","memory":"2Gi"}} as a response, which doesn't seem to match my values but rather the default values.


EDIT my issue was that I had previously set task-resource-attribute via pyflyte update task-resource-attribute for the project/domain. Deleting that allowed flyte to pick up the default task resouces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flyteadmin Issue for FlyteAdmin Service helm waiting for reporter Used for when we need input from the bug reporter
Projects
None yet
Development

No branches or pull requests

4 participants