support job scheduling #898

Garrybest · 2021-11-01T12:41:15Z

Signed-off-by: Garrybest [email protected]

What type of PR is this?
/kind feature

What this PR does / why we need it:
Add Job scheduling in scheduler. We divide replicas by spec.parallelism. And the spec.completions could be divided weighted by scheduling result.

Which issue(s) this PR fixes:
Fixes #893

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

support job scheduling

pigletfly · 2021-11-02T01:33:37Z

will #899 cover this case?

Garrybest · 2021-11-02T02:11:36Z

will #899 cover this case?

I'm not sure. Do you have any ideas? @RainbowMango

RainbowMango · 2021-11-02T07:18:11Z

Not sure, let me take a look.
I didn't notice Job should be scheduled.

mrlihanbo · 2021-11-02T08:03:33Z

I wonder how to aggregate status of job. We collect job status in detector before: https://github.com/karmada-io/karmada/blob/master/pkg/detector/aggregate_status.go#L193. We will set JobComplete condition when all job run in member cluster succeed. But with job scheduling, the meaning here need to adapt.

Garrybest · 2021-11-02T09:10:36Z

But with job scheduling, the meaning here need to adapt.

Hi @mrlihanbo, this is an interesting question. Should we need more adaption? I thought the aggregation does not need any more changes. When we divide replicas, we still mark job completed after all jobs in member clusters succeed right? Could you please give more details?

Garrybest · 2021-11-02T12:35:32Z

Leave a user story here.

Garrybest · 2021-11-03T07:08:47Z

Well, I think the startTime and completionTime of job status should be adapted when job replicas are divided. I have tested a job that has been divided into 2 member clusters. When the job finished, I didn't see the startTime and completionTime in karmada control plane.

mrlihanbo · 2021-11-03T08:13:23Z

But with job scheduling, the meaning here need to adapt.

Hi @mrlihanbo, this is an interesting question. Should we need more adaption? I thought the aggregation does not need any more changes. When we divide replicas, we still mark job completed after all jobs in member clusters succeed right? Could you please give more details?

Hi @Garrybest , for parallel Jobs with a fixed completion count, you should set .spec.completions to the number of completions needed. So the aggregation need more adaption, if we still mark job completed after all jobs in member clusters succeed, the behavior is not the same with native kubernetes, right?

Hi, @Garrybest my fault, it seems like we can still mark job completed after all jobs in member clusters succeed.

Garrybest · 2021-11-03T08:16:20Z

Hi, @Garrybest my fault, it seems like we can still mark job completed after all jobs in member clusters succeed.

Hi, @mrlihanbo. I was just about to reply😄. I found the relevant code here. If succeeded is more than completions, the job could be considered as Complete.

mrlihanbo · 2021-11-03T08:39:26Z

Hi, @Garrybest my fault, it seems like we can still mark job completed after all jobs in member clusters succeed.

Hi, @mrlihanbo. I was just about to reply😄. I found the relevant code here. If succeeded is more than completions, the job could be considered as Complete.

Hi, @Garrybest. I just wonder if there exists a scenario that complete = succeeded >= *job.Spec.Completions and I found that the scenario seems to be not exist. For example, cluster A specifies completions as 10, and finally 11 pod complete. cluster B specifies completions as 10, and finally 9 pod complete. The job in cluster B failed, but when aggregate status， the number of complete pod will be 20 which equal to job.Spec.Completions. But it seems like the scenario will not happen.

Garrybest · 2021-11-03T08:47:26Z

Hi, @Garrybest. I just wonder if there exists a scenario that complete = succeeded >= *job.Spec.Completions and I found that the scenario seems to be not exist.

Got it. I guess this scenario rarely happens. I'd prefer to consider this situation as JobFailed.

RainbowMango · 2021-11-05T12:04:38Z

/assign @mrlihanbo

pkg/controllers/binding/common.go

pkg/detector/detector.go

RainbowMango · 2021-11-08T13:13:18Z

Generlly looks good.

diff --git a/pkg/controllers/binding/common.go b/pkg/controllers/binding/common.go
index 422efde5..21b00e9a 100644
--- a/pkg/controllers/binding/common.go
+++ b/pkg/controllers/binding/common.go
@@ -84,12 +84,13 @@ func ensureWork(c client.Client, workload *unstructured.Unstructured, overrideMa
        var jobCompletions []workv1alpha2.TargetCluster
        var jobHasCompletions = false
        if workload.GetKind() == util.JobKind {
-               completions, ok, err := unstructured.NestedInt64(workload.Object, util.SpecField, util.CompletionsField)
+               completions, found, err := unstructured.NestedInt64(workload.Object, util.SpecField, util.CompletionsField)
                if err != nil {
                        return err
                }
-               if jobHasCompletions = ok; jobHasCompletions {
+               if found {
                        jobCompletions = util.DivideReplicasByTargetCluster(targetClusters, int32(completions))
+                       jobHasCompletions = true
                }
        }

@@ -125,6 +126,9 @@ func ensureWork(c client.Client, workload *unstructured.Unstructured, overrideMa
                                        clonedWorkload.GetKind(), clonedWorkload.GetNamespace(), clonedWorkload.GetName(), targetCluster.Name, err)
                                return err
                        }
+
+                       // For a work queue Job that usually leaves .spec.completions unset, in that case, we skip setting this field.
+                       // Refer to: https://kubernetes.io/docs/concepts/workloads/controllers/job/#parallel-jobs.
                        if jobHasCompletions {
                                err = applyReplicaSchedulingPolicy(clonedWorkload, int64(jobCompletions[i].Replicas), util.CompletionsField)
                                if err != nil {

Signed-off-by: Garrybest <[email protected]>

RainbowMango · 2021-11-11T01:25:26Z

@mrlihanbo Do you have any questions or comments?

RainbowMango · 2021-11-12T01:17:07Z

/lgtm
/approve

karmada-bot · 2021-11-12T01:17:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [RainbowMango]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 1, 2021

karmada-bot requested review from pigletfly and XiShanYongYe-Chang November 1, 2021 12:41

karmada-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 1, 2021

Garrybest force-pushed the pr_job branch from 287715e to 1b0b56c Compare November 1, 2021 14:57

Garrybest force-pushed the pr_job branch from 1b0b56c to 2bd3d4a Compare November 3, 2021 12:17

karmada-bot assigned mrlihanbo Nov 5, 2021

RainbowMango reviewed Nov 5, 2021

View reviewed changes

pkg/controllers/binding/common.go Outdated Show resolved Hide resolved

RainbowMango reviewed Nov 6, 2021

View reviewed changes

pkg/controllers/binding/common.go Outdated Show resolved Hide resolved

RainbowMango reviewed Nov 6, 2021

View reviewed changes

pkg/detector/detector.go Show resolved Hide resolved

Garrybest force-pushed the pr_job branch from 2bd3d4a to 2172396 Compare November 6, 2021 14:30

Garrybest force-pushed the pr_job branch from 2172396 to 283341f Compare November 9, 2021 06:26

support job scheduling

47a6823

Signed-off-by: Garrybest <[email protected]>

Garrybest force-pushed the pr_job branch from 283341f to 47a6823 Compare November 9, 2021 06:29

Garrybest requested a review from RainbowMango November 9, 2021 08:55

karmada-bot assigned RainbowMango Nov 12, 2021

karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 12, 2021

karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 12, 2021

karmada-bot merged commit 48c2bfb into karmada-io:master Nov 12, 2021

Garrybest deleted the pr_job branch November 12, 2021 02:15

RainbowMango mentioned this pull request Nov 12, 2021

[E2E]: Add test case cover job scheduling #955

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support job scheduling #898

support job scheduling #898

Garrybest commented Nov 1, 2021

pigletfly commented Nov 2, 2021

Garrybest commented Nov 2, 2021

RainbowMango commented Nov 2, 2021

mrlihanbo commented Nov 2, 2021

Garrybest commented Nov 2, 2021

Garrybest commented Nov 2, 2021

Garrybest commented Nov 3, 2021

mrlihanbo commented Nov 3, 2021

Garrybest commented Nov 3, 2021

mrlihanbo commented Nov 3, 2021

Garrybest commented Nov 3, 2021

RainbowMango commented Nov 5, 2021

RainbowMango commented Nov 8, 2021

RainbowMango commented Nov 11, 2021

RainbowMango commented Nov 12, 2021

karmada-bot commented Nov 12, 2021

support job scheduling #898

support job scheduling #898

Conversation

Garrybest commented Nov 1, 2021

pigletfly commented Nov 2, 2021

Garrybest commented Nov 2, 2021

RainbowMango commented Nov 2, 2021

mrlihanbo commented Nov 2, 2021

Garrybest commented Nov 2, 2021

Garrybest commented Nov 2, 2021

Garrybest commented Nov 3, 2021

mrlihanbo commented Nov 3, 2021

Garrybest commented Nov 3, 2021

mrlihanbo commented Nov 3, 2021

Garrybest commented Nov 3, 2021

RainbowMango commented Nov 5, 2021

RainbowMango commented Nov 8, 2021

RainbowMango commented Nov 11, 2021

RainbowMango commented Nov 12, 2021

karmada-bot commented Nov 12, 2021