Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JobSetTemplate API #573

Open
ahg-g opened this issue May 15, 2024 · 10 comments
Open

JobSetTemplate API #573

ahg-g opened this issue May 15, 2024 · 10 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@ahg-g
Copy link
Contributor

ahg-g commented May 15, 2024

What would you like to be added:
A JobSetTemplate API similar to PodTemplate.

Why is this needed:
APIs building on top of JobSet requires referencing a JobSet spec. The common approach is to embed that JobSet spec inside the higher level API, which makes it hard to validate, the other approach is to reference a template.

@ahg-g
Copy link
Contributor Author

ahg-g commented May 15, 2024

/feature

@tenzen-y
Copy link
Member

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label May 15, 2024
@googs1025
Copy link
Member

Hello, I want to share some simple ideas, I don’t know if they are what we need.

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSetTemplate
metadata:
  name: my-jobset-template
spec:
  failurePolicy:
    maxRestarts: 3
  replicatedJobs:
    - name: workers
      replicas: 1
      template:
        spec:
          backoffLimit: 0
          completions: 2
          parallelism: 2
          template:
            spec:
              containers:
                - name: worker
                  image: bash:latest
                  command:
                    - bash
                    - -xc
                    - |
                      sleep 1000
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: my-jobset
spec:
  templateRef:
    name: my-jobset-template 
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: paralleljobs
spec:
  replicatedJobs:
    - name: workers
      templateRef: my-jobset-template
    - name: driver
      templateRef: my-jobset-template
---
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSetTemplate
metadata:
  name: my-jobset-template
spec:
  replicas: 3
  template:
    spec:
      parallelism: 1
      completions: 1
      backoffLimit: 0
      template:
        spec:
          containers:
            - name: sleep
              image: busybox
              command:
                - sleep
              args:
                - 100s

If this approach is correct, perhaps we need another CR object and a controller to manage it.
I'm sorry if I misunderstood. Please forgive me if I got it wrong.

@googs1025
Copy link
Member

googs1025 commented Jul 2, 2024

@ahg-g @danielvegamyhre @kannon92 Could you please check if this is the way I understand it? If so, I will take it when I have time and write a kep design document

@kannon92
Copy link
Contributor

kannon92 commented Jul 25, 2024

I’d look at how CronJob uses JobTemplates or even how JobSet uses a JobTemplate.

A user should create a jobset without using the templates.

TrainJob could specify a template and that template would be used to create a Jobset. I think that’s the flow.

Generally the templates are used if someone wants to compose the object.

@googs1025
Copy link
Member

I’d look at how CronJob uses JobTemplates or even how JobSet uses a JobTemplate.

A user should create a jobset without using the templates.

TrainJob could specify a template and that template would be used to create a Jobset. I think that’s the flow.

Generally the templates are used if someone wants to compose the object.

Perhaps we can create a JobSetTemplateController to manage objects like JobSetTemplate. JobSetTemplate is template metadata. JobSet objects can reference this object. But I'm not sure if this is a good design

@andreyvelich
Copy link

According to this proposal: kubeflow/training-operator#2171, we are planning to create TrainingRuntime and ClusterTrainingRuntime to represent blueprints for various ML training or HPC configurations.
For LLMs runtimes, we will support list of different templates to fine-tune open-source foundational models.

Since we directly using JobSet API in the TrainingRuntime, I am wondering do we still need JobSetTemplates ?

@tenzen-y
Copy link
Member

According to this proposal: kubeflow/training-operator#2171, we are planning to create TrainingRuntime and ClusterTrainingRuntime to represent blueprints for various ML training or HPC configurations. For LLMs runtimes, we will support list of different templates to fine-tune open-source foundational models.

Since we directly using JobSet API in the TrainingRuntime, I am wondering do we still need JobSetTemplates ?

As my understanding, @ahg-g mentioned that he wants to try supporting this JobSetTemplate feature regardless of TrainigOperator v2.

@danielvegamyhre
Copy link
Contributor

According to this proposal: kubeflow/training-operator#2171, we are planning to create TrainingRuntime and ClusterTrainingRuntime to represent blueprints for various ML training or HPC configurations. For LLMs runtimes, we will support list of different templates to fine-tune open-source foundational models.
Since we directly using JobSet API in the TrainingRuntime, I am wondering do we still need JobSetTemplates ?

As my understanding, @ahg-g mentioned that he wants to try supporting this JobSetTemplate feature regardless of TrainigOperator v2.

Yes, we have another use case where JobSetTemplate would be useful - I can't elaborate much further right now since it isn't public yet, but there are definitely other use cases :)

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

8 participants