Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/support pytorchjob set queue of volcano #1415

Merged
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion pkg/controller.v1/pytorch/pytorchjob_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ import (
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/log"
"sigs.k8s.io/controller-runtime/pkg/manager"
podgroupv1beta1 "volcano.sh/apis/pkg/apis/scheduling/v1beta1"
volcanoclient "volcano.sh/apis/pkg/client/clientset/versioned"
)

Expand Down Expand Up @@ -155,13 +156,18 @@ func (r *PyTorchJobReconciler) Reconcile(ctx context.Context, req ctrl.Request)
// Set default priorities to pytorch job
r.Scheme.Default(pytorchjob)

// parse volcano Queue from pytorchjob Annotation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about other jobs?

schedulingPolicy := &commonv1.SchedulingPolicy{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Pytorch spec embed runPolicy, can we get scheduling policy directly from pytortchjob.Spec.RunPolicy.SchedulingPolicy?
@qiankunli

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use pytorchjob.Spec.RunPolicy as the argument to reconcile the jobs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jeffwan

now it is always nil for SchedulingPolicy in pytorch-operator, if SchedulingPolicy is seted , it is ok use pytortchjob.Spec.RunPolicy.SchedulingPolicy directly

// github.com/kubeflow/tf-operator/pkg/controller.v1/pytorch/pytorchjob_controller.go
runPolicy := &commonv1.RunPolicy{
		CleanPodPolicy:          pytorchjob.Spec.RunPolicy.CleanPodPolicy,
		TTLSecondsAfterFinished: pytorchjob.Spec.RunPolicy.TTLSecondsAfterFinished,
		ActiveDeadlineSeconds:   pytorchjob.Spec.RunPolicy.ActiveDeadlineSeconds,
		BackoffLimit:            pytorchjob.Spec.RunPolicy.BackoffLimit,
		SchedulingPolicy:        nil,
	}
// Use common to reconcile the job related pod and service
err = r.ReconcileJobs(pytorchjob, pytorchjob.Spec.PyTorchReplicaSpecs, pytorchjob.Status, runPolicy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jeffwan I update the pr

	runPolicy := &commonv1.RunPolicy{
		CleanPodPolicy:          pytorchjob.Spec.RunPolicy.CleanPodPolicy,
		TTLSecondsAfterFinished: pytorchjob.Spec.RunPolicy.TTLSecondsAfterFinished,
		ActiveDeadlineSeconds:   pytorchjob.Spec.RunPolicy.ActiveDeadlineSeconds,
		BackoffLimit:            pytorchjob.Spec.RunPolicy.BackoffLimit,
		SchedulingPolicy:        pytorchjob.Spec.RunPolicy.SchedulingPolicy,
	}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiankunli

  1. Can you help update this for MXNet job as well?
  2. Actually, since pytorch.Spec.RunPolicy is &commonv1.RunPolicy. We can pass pytorchjob.Spec.RunPolicy instead of constructing a new one. See xgboost example

https://github.com/kubeflow/tf-operator/blob/acba15e644b4c4d4fe6b68664407e4ea588d4458/pkg/controller.v1/xgboost/xgboostjob_controller.go#L176

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help update this for MXNet job as well?

Should we make it in another PR?

Queue: pytorchjob.Annotations[podgroupv1beta1.QueueNameAnnotationKey],
}

// Construct RunPolicy based on PyTorchJob.Spec
runPolicy := &commonv1.RunPolicy{
CleanPodPolicy: pytorchjob.Spec.RunPolicy.CleanPodPolicy,
TTLSecondsAfterFinished: pytorchjob.Spec.RunPolicy.TTLSecondsAfterFinished,
ActiveDeadlineSeconds: pytorchjob.Spec.RunPolicy.ActiveDeadlineSeconds,
BackoffLimit: pytorchjob.Spec.RunPolicy.BackoffLimit,
SchedulingPolicy: nil,
SchedulingPolicy: schedulingPolicy,
}

// Use common to reconcile the job related pod and service
Expand Down