-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sagemaker plugin: Error handling for custom training job #491
Comments
@bnsblue can you please help us understand what the status of this issue it? |
@kumare3 There are two parts in this item: capturing the output of the subprocess and outputing the content of the errors.pb to /opt/ml/failure. I've already done capturing the output of the subprocess, which allows Flyte itself to always be able to capture the messages from the pyflyte-execute even with the presence of the middle layer Due to the fact that this doesn't affect the status shown on Flyte's interface, this item was never prioritized nor implemented. I think it would require a bit work to discuss and think through what experience Flyte really wants the user to have. Once it is thought through, I believe it wouldn't require too much engineering effort to make it work. Unfortunately I don't have much chance to give it a deep thinking at the moment so I just unassigned myself. Please feel free to assign to anyone who would like to take any step. |
Before: A hardcoded string was used for setting the secret namespace After: The value for the secret namespace for settings is grabbed dynamically. Signed-off-by: Francisco J. Solis <[email protected]> Signed-off-by: Francisco J. Solis <[email protected]> Co-authored-by: Dan Rammer <[email protected]>
* Update config.go Set the default values to 0 Signed-off-by: LN <[email protected]> Signed-off-by: Ln11211 <[email protected]> * disable k8s controller-runtime manager metrics server (flyteorg#492) * setting MetricsBindAddress to 0 to disable controller-runtime manager metrics server Signed-off-by: Daniel Rammer <[email protected]> * and now in the webhook Signed-off-by: Daniel Rammer <[email protected]> Signed-off-by: Daniel Rammer <[email protected]> Signed-off-by: Ln11211 <[email protected]> * fix: Add servicename in certs (flyteorg#491) Before: A hardcoded string was used for setting the secret namespace After: The value for the secret namespace for settings is grabbed dynamically. Signed-off-by: Francisco J. Solis <[email protected]> Signed-off-by: Francisco J. Solis <[email protected]> Co-authored-by: Dan Rammer <[email protected]> Signed-off-by: Ln11211 <[email protected]> * Update config.go Removed DefaultDeadlines Signed-off-by: Ln11211 <[email protected]> Signed-off-by: LN <[email protected]> Signed-off-by: Ln11211 <[email protected]> Signed-off-by: Daniel Rammer <[email protected]> Signed-off-by: Francisco J. Solis <[email protected]> Co-authored-by: Dan Rammer <[email protected]> Co-authored-by: Francisco J. Solis <[email protected]>
* expose and use kubeclient configs if available Signed-off-by: Babis Kiosidis <[email protected]> * omit empty kubeclientconfig Signed-off-by: Babis Kiosidis <[email protected]> * setting configuration on all kubeclients Signed-off-by: Daniel Rammer <[email protected]> * addressing PR renaming comments Signed-off-by: Dan Rammer <[email protected]> Signed-off-by: Babis Kiosidis <[email protected]> Signed-off-by: Daniel Rammer <[email protected]> Signed-off-by: Dan Rammer <[email protected]> Co-authored-by: Babis Kiosidis <[email protected]>
* Updated dataclass example Signed-off-by: Kevin Su <[email protected]> * Fixed tests Signed-off-by: Kevin Su <[email protected]> * Fixed tests Signed-off-by: Kevin Su <[email protected]> * Updated example Signed-off-by: Kevin Su <[email protected]> * Update flytekit and comment Signed-off-by: Kevin Su <[email protected]> * add text Signed-off-by: Samhita Alla <[email protected]> * Update dependency Signed-off-by: Kevin Su <[email protected]> Co-authored-by: Samhita Alla <[email protected]>
Before: A hardcoded string was used for setting the secret namespace After: The value for the secret namespace for settings is grabbed dynamically. Signed-off-by: Francisco J. Solis <[email protected]> Signed-off-by: Francisco J. Solis <[email protected]> Co-authored-by: Dan Rammer <[email protected]>
* Update config.go Set the default values to 0 Signed-off-by: LN <[email protected]> Signed-off-by: Ln11211 <[email protected]> * disable k8s controller-runtime manager metrics server (flyteorg#492) * setting MetricsBindAddress to 0 to disable controller-runtime manager metrics server Signed-off-by: Daniel Rammer <[email protected]> * and now in the webhook Signed-off-by: Daniel Rammer <[email protected]> Signed-off-by: Daniel Rammer <[email protected]> Signed-off-by: Ln11211 <[email protected]> * fix: Add servicename in certs (flyteorg#491) Before: A hardcoded string was used for setting the secret namespace After: The value for the secret namespace for settings is grabbed dynamically. Signed-off-by: Francisco J. Solis <[email protected]> Signed-off-by: Francisco J. Solis <[email protected]> Co-authored-by: Dan Rammer <[email protected]> Signed-off-by: Ln11211 <[email protected]> * Update config.go Removed DefaultDeadlines Signed-off-by: Ln11211 <[email protected]> Signed-off-by: LN <[email protected]> Signed-off-by: Ln11211 <[email protected]> Signed-off-by: Daniel Rammer <[email protected]> Signed-off-by: Francisco J. Solis <[email protected]> Co-authored-by: Dan Rammer <[email protected]> Co-authored-by: Francisco J. Solis <[email protected]>
* expose and use kubeclient configs if available Signed-off-by: Babis Kiosidis <[email protected]> * omit empty kubeclientconfig Signed-off-by: Babis Kiosidis <[email protected]> * setting configuration on all kubeclients Signed-off-by: Daniel Rammer <[email protected]> * addressing PR renaming comments Signed-off-by: Dan Rammer <[email protected]> Signed-off-by: Babis Kiosidis <[email protected]> Signed-off-by: Daniel Rammer <[email protected]> Signed-off-by: Dan Rammer <[email protected]> Co-authored-by: Babis Kiosidis <[email protected]>
Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏 |
Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏 |
Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. |
[x] Capture and output subprocess stdout and stderr, and check subprocess execution return code flyteorg/flytekit#185
[ ] Write the content of errors.pb to /opt/ml/failure
Background: https://docs.google.com/document/d/118nUo2zbeiKbLnYo7A3bCyIzRuhG-cxqV3AWCJrCfYU/edit#heading=h.xg7j6wu6pw9a
The text was updated successfully, but these errors were encountered: