Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pytorch): Support elastic training #1453

Merged
merged 10 commits into from
Nov 26, 2021

Conversation

gaocegege
Copy link
Member

@gaocegege gaocegege commented Oct 28, 2021

PoC for kubeflow/community#522

This PR:

  • Updates the kubeflow/common to 0.4.1 to add LabelSelector in status, which supports scale subresources
  • Implements ElasticPolicy in PyTorchJob
  • Adds two elastic examples

This PR does not:

@aws-kf-ci-bot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gaocegege

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coveralls
Copy link

coveralls commented Nov 5, 2021

Pull Request Test Coverage Report for Build 1505952988

  • 220 of 568 (38.73%) changed or added relevant lines in 12 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+2.6%) to 26.627%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller.v1/xgboost/xgboostjob_controller.go 0 1 0.0%
pkg/controller.v1/pytorch/master.go 21 23 91.3%
pkg/apis/pytorch/v1/defaults.go 16 21 76.19%
pkg/controller.v1/pytorch/pytorch.go 27 33 81.82%
pkg/apis/mxnet/v1/openapi_generated.go 0 8 0.0%
pkg/apis/tensorflow/v1/openapi_generated.go 0 8 0.0%
pkg/apis/xgboost/v1/openapi_generated.go 0 8 0.0%
pkg/controller.v1/pytorch/elastic.go 95 117 81.2%
pkg/controller.v1/pytorch/pytorchjob_controller.go 53 77 68.83%
pkg/controller.v1/pytorch/hpa.go 8 56 14.29%
Files with Coverage Reduction New Missed Lines %
pkg/apis/pytorch/validation/validation.go 1 91.67%
Totals Coverage Status
Change from base Build 1493027296: 2.6%
Covered Lines: 1600
Relevant Lines: 6009

💛 - Coveralls

@gaocegege gaocegege force-pushed the pytorchelastic branch 3 times, most recently from 2bfbb3d to 46eeff9 Compare November 12, 2021 07:01
@gaocegege
Copy link
Member Author

46eeff9 implemented the feature.

Overall coverage increased (+7.1%) to 15.252%, PyTorch related test coverage is increased from 0% to 80%

PTAL

@gaocegege gaocegege marked this pull request as ready for review November 12, 2021 07:04
go.mod Outdated Show resolved Hide resolved
Signed-off-by: Ce Gao <[email protected]>
@gaocegege
Copy link
Member Author

The PR is ready to review. PTAL

/assign @kubeflow/wg-training-leads

@gaocegege
Copy link
Member Author

/assign @zw0610

@zw0610
Copy link
Member

zw0610 commented Nov 24, 2021

I suppose it's ready to review, right?

@gaocegege
Copy link
Member Author

�Yeah it is ready to review.

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @zw0610

@google-oss-prow google-oss-prow bot requested a review from zw0610 November 25, 2021 06:39
@gaocegege
Copy link
Member Author

Close #1483

---

apiVersion: v1
kind: Pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. with only 1 pod for etcd instead of 3?
  2. is this etcd instance deployed per (elastic) pytorch job or all elastic pytorch job will shall the etcd instance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can share one. And u do not need one if you are using c10d

examples/pytorch/elastic/imagenet/Dockerfile Outdated Show resolved Hide resolved
examples/pytorch/elastic/imagenet/Dockerfile Show resolved Hide resolved
examples/pytorch/elastic/imagenet/imagenet.yaml Outdated Show resolved Hide resolved
@@ -6916,6 +6916,53 @@ spec:
description: The number of pods which reached phase Failed.
format: int32
type: integer
labelSelector:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I do understand this change will definitely be introduced after we update the kubeflow/common apis definition, may be we should let other apis/controller to accept this changes in another pr.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If other operators does not support scale subresource, then it is not needed. Do you mean we should support the subresource for other operators?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the update to kubeflow/common, every time we run make generates in this repo, it shall always add labelSelector to other APIs. I would suggest to ignore such changes from APIs except PyTorchJob in this pr.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is hard to ignore it since the code generator reads the API definition and generate the corresponding code automatically. And the field labelSelector is a pointer thus it is optional. It does not affect existing CRDs

pkg/apis/pytorch/v1/types.go Outdated Show resolved Hide resolved
pkg/apis/pytorch/v1/types.go Outdated Show resolved Hide resolved
pkg/controller.v1/pytorch/hpa.go Outdated Show resolved Hide resolved
hack/update-codegen.sh Outdated Show resolved Hide resolved
pkg/controller.v1/pytorch/pytorch.go Outdated Show resolved Hide resolved
@gaocegege
Copy link
Member Author

The comments are addressed, PTAL @zw0610

Copy link
Member

@zw0610 zw0610 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Nov 26, 2021
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gaocegege, zw0610

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 3e11ac3 into kubeflow:master Nov 26, 2021
@gaocegege gaocegege deleted the pytorchelastic branch November 26, 2021 06:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants