Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Workflows nodes sometimes remain in "Running" state even when task fails #333

Closed
2 of 20 tasks
kumare3 opened this issue May 29, 2020 · 1 comment
Closed
2 of 20 tasks
Assignees
Labels
bug Something isn't working

Comments

@kumare3
Copy link
Contributor

kumare3 commented May 29, 2020

Describe the bug
Nodes remain in running state even when task and workflow has failed.

Expected behavior
All nodes should appear in the failed state.

Flyte component

  • Overall
  • Flyte Setup and Installation scripts
  • Flyte Documentation
  • Flyte communication (slack/email etc)
  • FlytePropeller
  • FlyteIDL (Flyte specification language)
  • Flytekit (Python SDK)
  • FlyteAdmin (Control Plane service)
  • FlytePlugins
  • DataCatalog
  • FlyteStdlib (common libraries)
  • FlyteConsole (UI)
  • Other

To Reproduce
Steps to reproduce the behavior:
run a node with a bad image (imagepull failure) and observe

Screenshots
Screen Shot 2020-05-28 at 11 23 37 PM

Environment
Flyte component

  • Sandbox (local or on one machine)
  • Cloud hosted
    • AWS
    • GCP
    • Azure
  • Baremetal
  • Other

Additional context
NA

@kumare3 kumare3 added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels May 29, 2020
kumare3 pushed a commit to flyteorg/flytepropeller that referenced this issue May 29, 2020
Abort always fails for a task if task was already in a terminal state - success, failure or retryable fail. This is because the event publish fails.
This fix ensures an event is not published for terminal cases.

 - [x] Bug Fix
 - [ ] Feature
 - [ ] Plugin

 - [x] Code completed
 - [x] Smoke tested
 - [x] Unit tests added
 - [x] Code documentation added
 - [x] Any pending items have an associated Issue

NA

flyteorg/flyte#333

NA
kumare3 pushed a commit to flyteorg/flytepropeller that referenced this issue May 29, 2020
Abort always fails for a task if task was already in a terminal state - success, failure or retryable fail. This is because the event publish fails.
This fix ensures an event is not published for terminal cases.

 - [x] Bug Fix
 - [ ] Feature
 - [ ] Plugin

 - [x] Code completed
 - [x] Smoke tested
 - [x] Unit tests added
 - [x] Code documentation added
 - [x] Any pending items have an associated Issue

NA

flyteorg/flyte#333

NA
@kumare3 kumare3 removed the untriaged This issues has not yet been looked at by the Maintainers label May 30, 2020
@kumare3 kumare3 self-assigned this May 30, 2020
@kumare3
Copy link
Contributor Author

kumare3 commented May 30, 2020

This is now merged. Should be part of 0.4.0

@kumare3 kumare3 closed this as completed May 30, 2020
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
* Single node GPU training example

Signed-off-by: Ketan Umare <[email protected]>

* Minor fix related to tensorboard in PyTorch (flyteorg#334)

Signed-off-by: Jinserk Baik <[email protected]>

* updated pytorch training example

Signed-off-by: Ketan Umare <[email protected]>

* updated

Signed-off-by: Ketan Umare <[email protected]>

* wandb integration, code lint, content

Signed-off-by: Samhita Alla <[email protected]>

* remove misplaced text

Signed-off-by: Samhita Alla <[email protected]>

* add pytorch in tests' manifest

Signed-off-by: Samhita Alla <[email protected]>

* changed pytorch to mnist

Signed-off-by: Samhita Alla <[email protected]>

* dockerfile

Signed-off-by: Samhita Alla <[email protected]>

* update link

Signed-off-by: cosmicBboy <[email protected]>

* update deps

Signed-off-by: cosmicBboy <[email protected]>

Co-authored-by: Jinserk Baik <[email protected]>
Co-authored-by: Samhita Alla <[email protected]>
Co-authored-by: cosmicBboy <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
* update pytorch multi-gpu example, incorporate comments @samhita-alla @kumare3

Signed-off-by: Niels Bantilan <[email protected]>

* Apply suggestions from code review

Co-authored-by: Samhita Alla <[email protected]>
Signed-off-by: Niels Bantilan <[email protected]>

Co-authored-by: Samhita Alla <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
Signed-off-by: Flyte-Bot <[email protected]>

Co-authored-by: flyte-bot <[email protected]>
pingsutw pushed a commit to pingsutw/flyte-monorepo that referenced this issue Apr 4, 2023
Abort always fails for a task if task was already in a terminal state - success, failure or retryable fail. This is because the event publish fails.
This fix ensures an event is not published for terminal cases.

 - [x] Bug Fix
 - [ ] Feature
 - [ ] Plugin

 - [x] Code completed
 - [x] Smoke tested
 - [x] Unit tests added
 - [x] Code documentation added
 - [x] Any pending items have an associated Issue

NA

flyteorg/flyte#333

NA
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Jul 24, 2023
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Aug 21, 2023
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Apr 30, 2024
Signed-off-by: Flyte-Bot <[email protected]>

Co-authored-by: flyte-bot <[email protected]>
austin362667 pushed a commit to austin362667/flyte that referenced this issue May 7, 2024
Signed-off-by: Flyte-Bot <[email protected]>

Co-authored-by: flyte-bot <[email protected]>
robert-ulbrich-mercedes-benz pushed a commit to robert-ulbrich-mercedes-benz/flyte that referenced this issue Jul 2, 2024
Signed-off-by: Flyte-Bot <[email protected]>

Co-authored-by: flyte-bot <[email protected]>
troychiu pushed a commit that referenced this issue Jul 8, 2024
…for the containers (#333)

## Overview
Union secrets injected env vars should appear at the beggining of the env list.

This requirement came from the issue faced during NIMs poc where the sidecar container which required secret to be passed in with specific env var name

The NGC sidecar container requires a secret to passed in ENV var `NGC_API_KEY`
Since union injected secrets use _UNION_ prefix, we couldn't define the secret to be NGC_API_KEY directly as it would be injected as _UNION_NGC_API_KEY

Adding of _UNION_ prefix is to be able to distinguish the secret env vars injected by the webhook,
Unchanging that functionality , the proposal is to use https://kubernetes.io/docs/tasks/inject-data-application/define-interdependent-environment-variables/

which allow you to define NGC_API_KEY as following

`NGC_API_KEY= $(_UNION_NGC_API_KEY)`


Also the change removes duplicates if the user is trying to define the same Env var which union is injecting 




## Test Plan
Before the change
```
k describe pods -n development  agd92xq6rbhsvn25g7qb
    Environment:
      FLYTE_INTERNAL_EXECUTION_WORKFLOW:  flytesnacks:development:using_secrets.main
      FLYTE_INTERNAL_EXECUTION_ID:        agd92xq6rbhsvn25g7qb
      FLYTE_INTERNAL_EXECUTION_PROJECT:   flytesnacks
      FLYTE_INTERNAL_EXECUTION_DOMAIN:    development
      FLYTE_ATTEMPT_NUMBER:               0
      FLYTE_INTERNAL_TASK_PROJECT:        flytesnacks
      FLYTE_INTERNAL_TASK_DOMAIN:         development
      FLYTE_INTERNAL_TASK_NAME:           using_secrets.fn
      FLYTE_INTERNAL_TASK_VERSION:        zEKw37ArzIKUrfgKOlUHUg
      FLYTE_INTERNAL_PROJECT:             flytesnacks
      FLYTE_INTERNAL_DOMAIN:              development
      FLYTE_INTERNAL_NAME:                using_secrets.fn
      FLYTE_INTERNAL_VERSION:             zEKw37ArzIKUrfgKOlUHUg
      FLYTE_SECRETS_ENV_PREFIX:           _UNION_
      _UNION_MY-CUSTOM-SECRET:            Thisisasecret\r

```

After the change on dogfood-gcp
```
k describe pods -n development  av8hbdjlmf5lzc8gbp5k
    Environment:
      _UNION_MY-CUSTOM-SECRET:            Thisisasecret\r
                                          
      FLYTE_SECRETS_ENV_PREFIX:           _UNION_
      FLYTE_INTERNAL_EXECUTION_WORKFLOW:  flytesnacks:development:using_secrets.main
      FLYTE_INTERNAL_EXECUTION_ID:        av8hbdjlmf5lzc8gbp5k
      FLYTE_INTERNAL_EXECUTION_PROJECT:   flytesnacks
      FLYTE_INTERNAL_EXECUTION_DOMAIN:    development
      FLYTE_ATTEMPT_NUMBER:               0
      FLYTE_INTERNAL_TASK_PROJECT:        flytesnacks
      FLYTE_INTERNAL_TASK_DOMAIN:         development
      FLYTE_INTERNAL_TASK_NAME:           using_secrets.fn
      FLYTE_INTERNAL_TASK_VERSION:        zEKw37ArzIKUrfgKOlUHUg
      FLYTE_INTERNAL_PROJECT:             flytesnacks
      FLYTE_INTERNAL_DOMAIN:              development
      FLYTE_INTERNAL_NAME:                using_secrets.fn
      FLYTE_INTERNAL_VERSION:             zEKw37ArzIKUrfgKOlUHUg
```

Notice the position of _UNION_MY-CUSTOM-SECRET. Any union secrets would show up at the beginning of the list of ENV vars

## Rollout Plan (if applicable)
Rollout to staging and then demo tenant for NIMS feature


## Upstream Changes
Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F).

- [] To be upstreamed to OSS

## Issue
*TODO: Link Linear issue(s) using [magic words](https://linear.app/docs/github#magic-words). `fixes` will move to merged status, while `ref` will only link the PR.*

## Checklist
* [ ] Added tests
* [ ] Ran a deploy dry run and shared the terraform plan
* [ ] Added logging and metrics
* [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list)
* [ ] Updated documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant