Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sagemaker plugin: Error handling for custom training job #491

Open
bnsblue opened this issue Aug 31, 2020 · 5 comments
Open

Sagemaker plugin: Error handling for custom training job #491

bnsblue opened this issue Aug 31, 2020 · 5 comments
Labels
plugins Plugins related labels (backend or frontend) stale

Comments

@bnsblue
Copy link
Contributor

bnsblue commented Aug 31, 2020

[x] Capture and output subprocess stdout and stderr, and check subprocess execution return code flyteorg/flytekit#185

[ ] Write the content of errors.pb to /opt/ml/failure
Background: https://docs.google.com/document/d/118nUo2zbeiKbLnYo7A3bCyIzRuhG-cxqV3AWCJrCfYU/edit#heading=h.xg7j6wu6pw9a

@bnsblue bnsblue assigned bnsblue and unassigned bnsblue Sep 2, 2020
@bnsblue bnsblue added this to the 0.8.0 milestone Sep 2, 2020
@anandswaminathan anandswaminathan modified the milestones: 0.8.0, 0.9.0 Sep 30, 2020
@EngHabu EngHabu modified the milestones: 0.9.0, 0.10.0 Nov 4, 2020
@bnsblue bnsblue changed the title Basic error handling for custom training job Error handling for custom training job Dec 12, 2020
@EngHabu EngHabu removed this from the 0.10.0 milestone Jan 11, 2021
@kumare3 kumare3 added this to the 0.12.0 milestone Mar 15, 2021
@kumare3
Copy link
Contributor

kumare3 commented Mar 15, 2021

@bnsblue can you please help us understand what the status of this issue it?

@bnsblue bnsblue removed their assignment Mar 15, 2021
@bnsblue
Copy link
Contributor Author

bnsblue commented Mar 15, 2021

@kumare3 There are two parts in this item: capturing the output of the subprocess and outputing the content of the errors.pb to /opt/ml/failure.

I've already done capturing the output of the subprocess, which allows Flyte itself to always be able to capture the messages from the pyflyte-execute even with the presence of the middle layer flyte_sagemaker_runner.py. However, my experience was that SM itself does not always know whether the training succeeded or not if /opt/ml/failure is empty. Outputting the content of errors.pb to /opt/ml/failure is just a measure to let SageMaker display that the training job fails in case the SM doesn't capture the true failure state.

Due to the fact that this doesn't affect the status shown on Flyte's interface, this item was never prioritized nor implemented.

I think it would require a bit work to discuss and think through what experience Flyte really wants the user to have. Once it is thought through, I believe it wouldn't require too much engineering effort to make it work. Unfortunately I don't have much chance to give it a deep thinking at the moment so I just unassigned myself. Please feel free to assign to anyone who would like to take any step.

@EngHabu EngHabu modified the milestones: 0.12.0, 0.13.0 Apr 1, 2021
@kumare3 kumare3 removed this from the 0.13.0 milestone May 5, 2021
@kumare3 kumare3 added the plugins Plugins related labels (backend or frontend) label Oct 14, 2021
@kumare3 kumare3 changed the title Error handling for custom training job Sagemaker plugin: Error handling for custom training job Oct 14, 2021
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 6, 2022
Before:
A hardcoded string was used for setting the secret namespace

After:
The value for the secret namespace for settings is grabbed dynamically.

Signed-off-by: Francisco J. Solis <[email protected]>

Signed-off-by: Francisco J. Solis <[email protected]>
Co-authored-by: Dan Rammer <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 6, 2022
* Update config.go

Set the default values to 0

Signed-off-by: LN <[email protected]>
Signed-off-by: Ln11211 <[email protected]>

* disable k8s controller-runtime manager metrics server (flyteorg#492)

* setting MetricsBindAddress to 0 to disable controller-runtime manager metrics server

Signed-off-by: Daniel Rammer <[email protected]>

* and now in the webhook

Signed-off-by: Daniel Rammer <[email protected]>

Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Ln11211 <[email protected]>

* fix: Add servicename in certs (flyteorg#491)

Before:
A hardcoded string was used for setting the secret namespace

After:
The value for the secret namespace for settings is grabbed dynamically.

Signed-off-by: Francisco J. Solis <[email protected]>

Signed-off-by: Francisco J. Solis <[email protected]>
Co-authored-by: Dan Rammer <[email protected]>
Signed-off-by: Ln11211 <[email protected]>

* Update config.go

Removed DefaultDeadlines

Signed-off-by: Ln11211 <[email protected]>

Signed-off-by: LN <[email protected]>
Signed-off-by: Ln11211 <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Francisco J. Solis <[email protected]>
Co-authored-by: Dan Rammer <[email protected]>
Co-authored-by: Francisco J. Solis <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
* expose and use kubeclient configs if available

Signed-off-by: Babis Kiosidis <[email protected]>

* omit empty kubeclientconfig

Signed-off-by: Babis Kiosidis <[email protected]>

* setting configuration on all kubeclients

Signed-off-by: Daniel Rammer <[email protected]>

* addressing PR renaming comments

Signed-off-by: Dan Rammer <[email protected]>

Signed-off-by: Babis Kiosidis <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Dan Rammer <[email protected]>
Co-authored-by: Babis Kiosidis <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
* Updated dataclass example

Signed-off-by: Kevin Su <[email protected]>

* Fixed tests

Signed-off-by: Kevin Su <[email protected]>

* Fixed tests

Signed-off-by: Kevin Su <[email protected]>

* Updated example

Signed-off-by: Kevin Su <[email protected]>

* Update flytekit and comment

Signed-off-by: Kevin Su <[email protected]>

* add text

Signed-off-by: Samhita Alla <[email protected]>

* Update dependency

Signed-off-by: Kevin Su <[email protected]>

Co-authored-by: Samhita Alla <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Aug 9, 2023
Before:
A hardcoded string was used for setting the secret namespace

After:
The value for the secret namespace for settings is grabbed dynamically.

Signed-off-by: Francisco J. Solis <[email protected]>

Signed-off-by: Francisco J. Solis <[email protected]>
Co-authored-by: Dan Rammer <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Aug 9, 2023
* Update config.go

Set the default values to 0

Signed-off-by: LN <[email protected]>
Signed-off-by: Ln11211 <[email protected]>

* disable k8s controller-runtime manager metrics server (flyteorg#492)

* setting MetricsBindAddress to 0 to disable controller-runtime manager metrics server

Signed-off-by: Daniel Rammer <[email protected]>

* and now in the webhook

Signed-off-by: Daniel Rammer <[email protected]>

Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Ln11211 <[email protected]>

* fix: Add servicename in certs (flyteorg#491)

Before:
A hardcoded string was used for setting the secret namespace

After:
The value for the secret namespace for settings is grabbed dynamically.

Signed-off-by: Francisco J. Solis <[email protected]>

Signed-off-by: Francisco J. Solis <[email protected]>
Co-authored-by: Dan Rammer <[email protected]>
Signed-off-by: Ln11211 <[email protected]>

* Update config.go

Removed DefaultDeadlines

Signed-off-by: Ln11211 <[email protected]>

Signed-off-by: LN <[email protected]>
Signed-off-by: Ln11211 <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Francisco J. Solis <[email protected]>
Co-authored-by: Dan Rammer <[email protected]>
Co-authored-by: Francisco J. Solis <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Aug 21, 2023
* expose and use kubeclient configs if available

Signed-off-by: Babis Kiosidis <[email protected]>

* omit empty kubeclientconfig

Signed-off-by: Babis Kiosidis <[email protected]>

* setting configuration on all kubeclients

Signed-off-by: Daniel Rammer <[email protected]>

* addressing PR renaming comments

Signed-off-by: Dan Rammer <[email protected]>

Signed-off-by: Babis Kiosidis <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Dan Rammer <[email protected]>
Co-authored-by: Babis Kiosidis <[email protected]>
@github-actions
Copy link

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

@github-actions github-actions bot added the stale label Aug 26, 2023
@github-actions
Copy link

github-actions bot commented Sep 2, 2023

Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 2, 2023
@eapolinario eapolinario reopened this Nov 2, 2023
@github-actions github-actions bot removed the stale label Nov 3, 2023
Copy link

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable.
Thank you for your contribution and understanding! 🙏

@github-actions github-actions bot added the stale label Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plugins Plugins related labels (backend or frontend) stale
Projects
None yet
Development

No branches or pull requests

5 participants