Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Enable custom training job in SageMaker plugin #113

Merged
merged 48 commits into from
Sep 3, 2020

Conversation

bnsblue
Copy link
Contributor

@bnsblue bnsblue commented Aug 13, 2020

TL;DR

This PR adds a flyteplugin to properly generate the CRD for SageMaker custom training job and monitor the status of the submitted job.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

See flyteorg/flyte#451

Tracking Issue

https://github.com/lyft/flyte/issues/

Follow-up issue

NA
OR
https://github.com/lyft/flyte/issues/

@codecov-commenter
Copy link

codecov-commenter commented Aug 14, 2020

Codecov Report

Merging #113 into master will increase coverage by 3.12%.
The diff coverage is 71.20%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #113      +/-   ##
==========================================
+ Coverage   58.75%   61.87%   +3.12%     
==========================================
  Files          99      104       +5     
  Lines        5957     6434     +477     
==========================================
+ Hits         3500     3981     +481     
+ Misses       2106     2084      -22     
- Partials      351      369      +18     
Impacted Files Coverage Δ
go/tasks/plugins/k8s/sagemaker/config/config.go 0.00% <0.00%> (ø)
go/tasks/plugins/k8s/sagemaker/outputs.go 20.51% <20.51%> (ø)
...sks/plugins/k8s/sagemaker/hyperparameter_tuning.go 61.23% <61.23%> (ø)
go/tasks/plugins/k8s/sagemaker/custom_training.go 68.42% <68.42%> (ø)
go/tasks/plugins/k8s/sagemaker/builtin_training.go 72.18% <72.18%> (ø)
go/tasks/plugins/k8s/sagemaker/plugin.go 73.07% <73.07%> (ø)
go/tasks/plugins/k8s/sagemaker/utils.go 72.76% <80.43%> (+10.18%) ⬆️
...o/tasks/plugins/k8s/sagemaker/plugin_test_utils.go 95.93% <96.72%> (ø)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update af40d96...d64afce. Read the comment docs.

go/tasks/plugins/awsutils/awsutils.go Outdated Show resolved Hide resolved
go/tasks/pluginmachinery/k8s/plugin.go Show resolved Hide resolved
go/tasks/plugins/k8s/sagemaker/outputs.go Show resolved Hide resolved
go/tasks/plugins/k8s/sagemaker/sagemaker.go Outdated Show resolved Hide resolved
go/tasks/plugins/k8s/sagemaker/sagemaker.go Outdated Show resolved Hide resolved
go/tasks/plugins/k8s/sagemaker/sagemaker.go Outdated Show resolved Hide resolved
go/tasks/plugins/k8s/sagemaker/sagemaker.go Outdated Show resolved Hide resolved
go/tasks/plugins/k8s/sagemaker/sagemaker.go Outdated Show resolved Hide resolved
go/tasks/plugins/k8s/sagemaker/sagemaker.go Outdated Show resolved Hide resolved
go/tasks/plugins/k8s/sagemaker/utils.go Outdated Show resolved Hide resolved
@bnsblue bnsblue requested review from EngHabu and kumare3 September 2, 2020 02:37
@kumare3
Copy link
Contributor

kumare3 commented Sep 2, 2020

@bnsblue can we update the description?

@bnsblue bnsblue requested a review from kumare3 September 2, 2020 21:20
return trainingJob, nil
}

func (m awsSagemakerPlugin) getTaskPhaseForTrainingJob(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this different for every job type? or are the jobSTatus common?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output handling is different for different type of tasks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the job statuses are not common either

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I realized once I went through all of them

OutputDataConfig: &commonv1.OutputDataConfig{
S3OutputPath: ToStringPtr(outputPath),
},
CheckpointConfig: nil,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is one thing we might want to model in the future. Passing in a custom checkpoint config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we will do this soon

func (m awsSagemakerPlugin) getEventInfoForJob(ctx context.Context, job k8s.Resource) (*pluginsCore.TaskInfo, error) {

var jobRegion, jobName, jobTypeInURL, sagemakerLinkName string
if m.TaskType == trainingJobTaskType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move these parts as submethods in the specific files? co-located code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite get what you mean. Mind to elaborate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when task type == custom that code block could be in the custom.go file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done refactoring

@kumare3
Copy link
Contributor

kumare3 commented Sep 2, 2020

@bnsblue This is awesome work! Thank you. Couple comments from my side

@bnsblue bnsblue requested a review from kumare3 September 3, 2020 02:25
@bnsblue
Copy link
Contributor Author

bnsblue commented Sep 3, 2020

Would you mind elaborating some of your comments?
Also could you please take another look? @kumare3

@kumare3
Copy link
Contributor

kumare3 commented Sep 3, 2020

Will do

@bnsblue
Copy link
Contributor Author

bnsblue commented Sep 3, 2020

@kumare3 please take another look

Copy link
Contributor

@kumare3 kumare3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me!

@bnsblue bnsblue merged commit adfa4b2 into master Sep 3, 2020
eapolinario pushed a commit that referenced this pull request Sep 6, 2023
* checking key existence in GetRole()

* (training job plugin only) inject hp; refactor image uri getting mechanism; some other smaller refactoring

* change the key and value of the flyte sm command

* get rid of a obsolete function

* add and change unit tests

* add unit tests

* lint

* refactor a dummy object generation function

* refactor a dummy object generation function

* add unit tests

* lint

* forming the runner cmd and inject it as a hyperparameter in SageMaker plugin

* fix unit tests

* fix unit tests

* add custom training plugin

* revert overriden changes

* compose the __FLYTE_SAGEMAKER_CMD__ in custom training job plugin

* add unit test

* lint

* fix getEventInfoForJob

* fix image getting

* fix unit tests

* stick with inputs.pb and modify hp injecting logic accordingly

* fix args converting logic

* use default file-based output for custom training job

* expanding PluginContext interface with necessary methods so SM plugin can access the DataStore and such

* lint error

* add unit tests

* add logic to inject env vars into hyperparameters

* fix output prefix

* fix output prefix

* remove job name from output prefix for now

* fix a unit test

* accommodating new arg and env var passsing syntax

* injecting a env var to force disable statsd for sagemaker custom training

* correcting variable name

* remove unused constant

* remove comments

* fix unit tests

* merge template.go

* pr comments

* add guarding statement wrt algorithm name for custom training plugin and built-in training plugin

* refactor file structures: splitting the code into multiple files to optimize for readability

* add documentations to a set of constants, and fix a constant's name

* split tests into multiple files

* correcting error types: make permanent failures

* refactor
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants