Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graduate the API to v1 #380

Open
Tracked by #350
ahg-g opened this issue Jan 18, 2024 · 18 comments
Open
Tracked by #350

Graduate the API to v1 #380

ahg-g opened this issue Jan 18, 2024 · 18 comments
Assignees

Comments

@ahg-g
Copy link
Contributor

ahg-g commented Jan 18, 2024

Graduate JobSet API to v1. We need to keep v1apha1 for a few more releases to make it easier to customers to migrate.

Ref: https://book.kubebuilder.io/multiversion-tutorial/api-changes

@danielvegamyhre
Copy link
Contributor

@ahg-g and I were discussing skipping v1beta1 and graduating directly to v1, since several companies have been using for their production training workloads for a few months now, so it effectively GA already.

@kannon92 @vsoch what do you think?

@vsoch
Copy link
Contributor

vsoch commented Feb 7, 2024

It depends on if you think it's really free of errors and potential issues, or not. I don't see any harm in doing v1beta1 and then having that wiggle room, but it's up to you!

@kannon92
Copy link
Contributor

kannon92 commented Feb 7, 2024

I think the main thing about v1 promotion is that means we should not break any existing user. I think we have been pretty careful on not breaking the API anymore so I think its fine to promote to GA.

Could you ask those companies (or even you!) to create an adopters page? It'd be nice to convey that people are actually using this project for something.

I created #398 for the adopters page actually

@ahg-g
Copy link
Contributor Author

ahg-g commented Feb 9, 2024

@vsoch it is not about free of errors, this is more about what commitment we are making to the API stability. Since multiple users are already dependent on it, this is practically becoming GA because everything we do moving forward must be backward compatible, so might as well just make that commitment official in the API.

Could you ask those companies (or even you!) to create an adopters page? It'd be nice to convey that people are actually using this project for something.

we can mention that Google Cloud is using it (we are not at liberty to list the customers); @vsoch if you feel comfortable perhaps we can list Lawrence Livermore National Laboratory?

@ahg-g ahg-g changed the title Graduate the API to v1beta1 Graduate the API to v1 Feb 9, 2024
@vsoch
Copy link
Contributor

vsoch commented Feb 9, 2024

I will ask! To be clear - "using it" meaning for development and prototyping or in production? We do not have a production Kubernetes cluster. That's what we are working towards.

@danielvegamyhre
Copy link
Contributor

/assign

@kannon92
Copy link
Contributor

Starting this: #518

@kannon92
Copy link
Contributor

kannon92 commented May 7, 2024

@ahg-g @danielvegamyhre from the KubeFlow discussions, do we want to table this?

@ahg-g
Copy link
Contributor Author

ahg-g commented May 8, 2024

Yes, I think so.

@andreyvelich
Copy link

Based on our recent conversations let's have a chat on the next Batch WG and Kubeflow Training WG calls to define actions items. It would be nice to identify list of pending APIs for JobSet V1 for various ML training/fine-tuning use-cases (e.g. PodGroups, Elastic Jobs, Stateful Indexed Jobs, etc.)
We can discuss short and long term goals, and gradually start working on them.
cc @tenzen-y @johnugeorge @terrytangyuan

@danielvegamyhre
Copy link
Contributor

@ahg-g @andreyvelich can you clarify what aspect of the recent discussions led you to want to pause graduation to v1? I've been having to spend a lot of time on an internal project lately and missed some of the latest conversations I think.

@andreyvelich
Copy link

andreyvelich commented May 8, 2024

@ahg-g @andreyvelich can you clarify what aspect of the recent discussions led you to want to pause graduation to v1? I've been having to spend a lot of time on an internal project lately and missed some of the latest conversations I think.

A few examples:

  1. Support elastic policy to create HPA for PyTorch elastic.
  2. JobSet doesn't have Restarting and Running conditions.
  3. @ahg-g proposed to introduce the JobSetTemplate that we can deploy together with Training Operator to simplify submission of Distributed PyTorch for users without understanding how to configure environment variables in JobSet.

As we discussed in this thread: https://docs.google.com/document/d/1C2ev7yRbnMTlQWbQCfX7BCCHcLIAGW2MP9f7YeKl2Ck/edit?disco=AAABKU-uQyA
since we want to make JobSet APIs stable in V1, it would be nice to prototype production use-cases for Jax, MPI, or PyTorch to understand if we need to make any changes to the JobSet APIs.

I am happy to discuss it on the next Batch WG call tomorrow if we have time for it. cc @bigsur0

@ahg-g
Copy link
Contributor Author

ahg-g commented May 10, 2024

Thanks @andreyvelich, who can help create tracking issues for the first two that describe the requirements in more details? @tenzen-y ?

@tenzen-y
Copy link
Member

Thanks @andreyvelich, who can help create tracking issues for the first two that describe the requirements in more details? @tenzen-y ?

Thank you for mentioning me. Yes, I can help to create some issues.
Let me summarize contexts and requirements.

@tenzen-y
Copy link
Member

tenzen-y commented May 14, 2024

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 12, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 11, 2024
@ahg-g
Copy link
Contributor Author

ahg-g commented Sep 12, 2024

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants