Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Add manifests for Kubeflow Training V2 #2289

Merged
merged 5 commits into from
Oct 21, 2024

Conversation

andreyvelich
Copy link
Member

Fixes: #2208

I added manifests for Training Operator V2. I renamed the manager image to:

docker.io/kubeflow/training-operator-v2

That will allow users to use latest version for both V1 and V2 version of Training Operator.
In the future, we can deprecate the old version of Training Operator.

For now, I install JobSet using the release manifests in the Kustomize overlay. Let's discuss with @kubeflow/wg-manifests-leads what is the better approach long-term.

Additionally, I fixed the invalid validating webhook configuration name in our cert generator.

/assign @kubeflow/wg-training-leads
/hold for review

Signed-off-by: Andrey Velichkevich <[email protected]>
@coveralls
Copy link

coveralls commented Oct 18, 2024

Pull Request Test Coverage Report for Build 11413026138

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 1 of 1 (100.0%) changed or added relevant line in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 11390874125: 0.0%
Covered Lines: 73
Relevant Lines: 73

💛 - Coveralls

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding manifests!

Comment on lines 50 to 51
//+kubebuilder:rbac:groups=jobset.x-k8s.io,resources=jobsets,verbs=get;list;watch;create
//+kubebuilder:rbac:groups=scheduling.x-k8s.io,resources=podgroups,verbs=get;list;watch;create
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move these markers to the each job pipeline frame work plugins?

https://github.com/kubeflow/training-operator/tree/master/pkg/runtime.v2/framework/plugins

Comment on lines 48 to 49
//+kubebuilder:rbac:groups=kubeflow.org,resources=trainingruntimes,verbs=get;list;watch
//+kubebuilder:rbac:groups=kubeflow.org,resources=clustertrainingruntimes,verbs=get;list;watch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move these markers to each job pipeline framework series?

https://github.com/kubeflow/training-operator/tree/master/pkg/runtime.v2/core

@@ -52,6 +52,8 @@ import (
const (
// EnvKubeflowNamespace is an environment variable for namespace when deployed on kubernetes
EnvKubeflowNamespace = "KUBEFLOW_NAMESPACE"

webhookConfigurationName = "validator.training-operator.kubeflow.org"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we need to refine the webhook controller-gen marker as well:

// +kubebuilder:webhook:path=/validate-kubeflow-org-v1-pytorchjob,mutating=false,failurePolicy=fail,sideEffects=None,groups=kubeflow.org,resources=pytorchjobs,verbs=create;update,versions=v1,name=validator.pytorchjob.training-operator.kubeflow.org,admissionReviewVersions=v1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should change it, since those parameters defines the particular webhook configuration, not the name of validation webhook configuration object. For example:

name: validator.trainingruntime.kubeflow.org

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right.
This will be used in

- op: replace
  path: /metadata/name
  value: validator.training-operator-v2.kubeflow.org

@@ -44,6 +44,10 @@ import (
webhookv2 "github.com/kubeflow/training-operator/pkg/webhook.v2"
)

const (
webhookConfigurationName = "validator.training-operator-v2.kubeflow.org"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto:

// +kubebuilder:webhook:path=/validate-kubeflow-org-v2alpha1-trainjob,mutating=false,failurePolicy=fail,sideEffects=None,groups=kubeflow.org,resources=trainjobs,verbs=create;update,versions=v2alpha1,name=validator.trainjob.kubeflow.org,admissionReviewVersions=v1

metadata:
name: training-operator-v2
labels:
training.kubeflow.org/component: manager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of these, could we use the kubernetes recommended labels?

https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/

Signed-off-by: Andrey Velichkevich <[email protected]>
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andreyvelich
Copy link
Member Author

/hold cancel

@google-oss-prow google-oss-prow bot merged commit fd4d102 into kubeflow:master Oct 21, 2024
41 checks passed
google-oss-prow bot pushed a commit that referenced this pull request Oct 21, 2024
* KEP-2170: Add manifests for Kubeflow Training V2

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix invalid name for webhook config in cert

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix integration tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Move kubebuilder markers to runtime framework

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use Kubernetes recommended labels

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich andreyvelich deleted the issue-2208-v2-manifests branch October 21, 2024 19:46
saileshd1402 pushed a commit to saileshd1402/training-operator that referenced this pull request Dec 2, 2024
* KEP-2170: Add manifests for Kubeflow Training V2

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix invalid name for webhook config in cert

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix integration tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Move kubebuilder markers to runtime framework

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use Kubernetes recommended labels

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KEP-2170: Create Kustomize manifests to deploy JobSet and TrainJob controllers
3 participants