Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(framework): Support JAX #1619

Closed
gaocegege opened this issue Jun 22, 2022 · 31 comments · Fixed by #2194
Closed

feat(framework): Support JAX #1619

gaocegege opened this issue Jun 22, 2022 · 31 comments · Fixed by #2194

Comments

@gaocegege
Copy link
Member

JAX becomes extremely popular these days. Users may expect to run JAX distributed training jobs on Kubernetes with the help of training-operator.

JAX uses a “multi-controller” programming model where each JAX Python process runs independently, sometimes referred to as a Single Program, Multiple Data (SPMD) model. I think it is not hard to support from the operator's perspective.

image


Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

@zw0610
Copy link
Member

zw0610 commented Jun 23, 2022

Any solid examples showing how a multi-host JAX job runs? Especially the host registering part.

@kuizhiqing
Copy link
Member

Well, to launch a distributed training with Jax, jax.distributed.initialize api should be used.
A method to implement this would be like this, training operator provide related environ for each container and the user script should handle them as below. src.

coordinator_address= os.environ.get('JAX_COORDINATOR_ADDRESS', None)  # defined internal
num_processes = int(os.environ.get('JAX_NUM_PROCESSES', 1))  # world size
process_id = int(os.environ.get('JAX_PROCESS_ID', 0)) # rank

jax.distributed.initialize(coordinator_address=coordinator_address, num_processes=num_processes, process_id=process_id)

Anyway, I'm not aware of any mature practice of this in production.

@andreyvelich
Copy link
Member

/help

@google-oss-prow
Copy link

@andreyvelich:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@tenzen-y
Copy link
Member

/lifecycle frozen

@andreyvelich
Copy link
Member

/assign @Davidnet

For anyone interested in Jax support for Training Operator, please join our AutoML and Training WG Call on November 29th 5:00pm UTC:
https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit#heading=h.yvypq06ot57p

We are going to discuss how we can move forward with the Jax support.

@tenzen-y
Copy link
Member

cc: @mimowo

Michal may be interested in this JAX integration.

@yzhao-2023
Copy link

yzhao-2023 commented Dec 22, 2023

We'll be interested in supporting JAX as well.
Would be interested in contributing developer hours (with mentoring from a qualified Kubeflow maintainer)
@sxwl-donggang

@johnugeorge
Copy link
Member

Thanks for the interest. Happy to help @yzhao-2023

@andreyvelich
Copy link
Member

andreyvelich commented Dec 22, 2023

That would be great if you could help us with the Jax implementation in Training Operator.

If you are available @yzhao-2023, please attend one of our upcoming AutoML and Training WG calls.
We can guide you through Training Operator implementation and how we can add Jax support.

@kuizhiqing
Copy link
Member

That would be great if you could help us with the Jax implementation in Training Operator.

If you are available @yzhao-2023, please attend one of our upcoming AutoML and Training WG calls. We can guide you through Training Operator implementation and how we can add Jax support.

If you prefer 11:00 am UTC, I'd like to be there too.

@jdcfd
Copy link
Contributor

jdcfd commented Feb 11, 2024

Links from the 2023-11-29 Meeting notes:

In the meeting they mentioned that the documentation for the Training Operator was a bit outdated. Has it been updated?
EDIT: I see that the .md file was updated 2 weeks ago so I suppose it is up to date.

@jdcfd
Copy link
Contributor

jdcfd commented Feb 11, 2024

Also, if you end up doing a Training Operator Deep dive session, it would be good if you share it here so anyone wanting to contribute can join or watch a recording later.

@octonawish-akcodes
Copy link

hi @andreyvelich im interested in this issue for the upcoming term of gsoc, is there any roadmap doc available and can you provide some more context or resources I can look up to better understand it?

@andreyvelich
Copy link
Member

Hi @jdcfd @octonawish-akcodes, thank you for your interest to work on Jax support in Training Operator!

If you are available, please attend one of the upcoming AutoML and Training WG calls: https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit#heading=h.yvypq06ot57p
We will discuss details on how we can add support for JaxJobs

@jdcfd
Copy link
Contributor

jdcfd commented Feb 22, 2024

Sorry I missed it, Wednesdays are very tough for me. I did watch last recording and I will watch today's meeting later this week. Judging by the meeting notes, it seems like the JaxJobs topic wasn't touched this time.

@andreyvelich
Copy link
Member

Hi @jdcfd, we briefly discussed Jax support in the recent call: https://youtu.be/rXBCliRugNk
We are going to speak more about Jax in the next Training WG community meetings.
/area gsoc

@sandipanpanda
Copy link
Member

I am interested in collaborating on a design proposal for integrating Jax into Training Operator.

@ahg-g
Copy link

ahg-g commented Mar 28, 2024

Why not just use the Job or JobSet API, what is missing?

@tenzen-y
Copy link
Member

Why not just use the Job or JobSet API, what is missing?

Does it mean that why don't you recommend using Job or JobSet instead of TrainingOperator?

@ahg-g
Copy link

ahg-g commented Mar 28, 2024

Yes, for the Jax case, I think Indexed Job will just work and the Job ships with any k8s cluster, so you don't need to install any extra operators. For more advanced setups, like multi-slice TPUs, JobSet works well and it is easy to transition from Job to JobSet.

@andreyvelich
Copy link
Member

andreyvelich commented Mar 28, 2024

@ahg-g As we discussed recently, we should understand if JobSet can cover all use-cases for Jax and other ML frameworks. I remember previously @tenzen-y was working on to add SuccessPolicy support to the Job APIs, so we can re-use Job in the Training Operator.

Also, we should understand the following:

  • Does Jax support any specific distributed training capabilities that will require to orchestrate additional Kubernetes resources ? Like MPI-Operator.
  • Do we need specific resource statuses that will be exclusive to JaxJob but not to other Jobs (e.g. PyTorchJob).

To be clear, I am not against using JobSet as a final entity for distributed ML training on Kubernetes and deprecate framework specific CRs, but we need to discuss pros and cons.

Moreover, when @Jeffwan and @zw0610 designed unified Training Operator they proposed the idea of common CR where we have Frontend Operator to manage framework specific resources and Role operator to manage common resources. By that time (2021) we didn't have JobSet yet.
In that case, the flow looks like this:

JaxJob -> JobSet -> Job -> Pod
PyTorchJob -> JobSet -> Job -> Pod

Let's collaborate together in the upcoming WG Batch and Kubeflow WG Training community calls to discuss our next steps.

cc @bigsur0

@tenzen-y
Copy link
Member

Yes, for the Jax case, I think Indexed Job will just work and the Job ships with any k8s cluster, so you don't need to install any extra operators. For more advanced setups, like multi-slice TPUs, JobSet works well and it is easy to transition from Job to JobSet.

@ahg-g I think that kubeflow/JaxJob has some advantages: 1. It can use the same CRD as other frameworks like PyTorchJob and TFJob; 2. No need to setup EnvVars and services; 3 It is possible to use higher level Python SDK.

Indeed, some developers prefer to use the plain Job and JobSet for extensibility, but I believe that some developers prefer to use more abstract API.

So, I believe that both approaches are valuable.

@tenzen-y
Copy link
Member

@ahg-g As we discussed recently, we should understand if JobSet can cover all use-cases for Jax and other ML frameworks. I remember previously @tenzen-y was working on to add SuccessPolicy support to the Job APIs, so we can re-use Job in the Training Operator.

Also, we should understand the following:

  • Does Jax support any specific distributed training capabilities that will require to orchestrate additional Kubernetes resources ? Like MPI-Operator.
  • Do we need specific resource statuses that will be exclusive to JaxJob but not to other Jobs (e.g. PyTorchJob).

To be clear, I am not against using JobSet as a final entity for distributed ML training on Kubernetes and deprecate framework specific CRs, but we need to discuss pros and cons.

Moreover, when @Jeffwan and @kuizhiqing designed unified Training Operator they proposed the idea of common CR where we have Frontend Operator to manage framework specific resources and Role operator to manage common resources. By that time (2021) we didn't have JobSet yet. In that case, the flow looks like this:

JaxJob -> JobSet -> Job -> Pod
PyTorchJob -> JobSet -> Job -> Pod

Let's collaborate together in the upcoming WG Batch and Kubeflow WG Training community calls to discuss our next steps.

cc @bigsur0

I totally agree with @andreyvelich.

@ahg-g
Copy link

ahg-g commented Mar 28, 2024

Thanks @tenzen-y and @andreyvelich.

My worry is that adding another API on top means another operator and so more sources of errors and additional operational overhead.

The points related to automating the configurations (envVars, configmaps etc.) are valid and it is something we are thinking about solutions for in JobSet. One idea is JobSet "extensions": imagine that the JobSet API includes an opaque class parameter of type "Object" that represents the specific training job you want to run, and we introduce hooks in JobSet operator and webhook to actuate on it.

kind: JobSet
spec:
  class: 
    kind: MPI
    ...

The MPI extension within JobSet would know how to parse this class, and populate JobsSet with all things MPI. This is just a rough idea, devil in the details as usual :)

@andreyvelich
Copy link
Member

that represents the specific training job you want to run, and we introduce hooks in JobSet operator and webhook to actuate on it.

@ahg-g In that case the mutation webhook will be responsible to orchestrate additional Kubernetes resources for the Job (e.g. ConfigMap, RBAC) ? How we are going to handle orchestration that needs to happen during the Job runtime ? For example, fetch appropriate status or SSH to the pod in case of MPIJob ?

@ahg-g
Copy link

ahg-g commented Apr 3, 2024

I meant hooks in the general sense as in places where we invoke the workload specific function, that would be one in the webhook and one in the reconciler.

@andreyvelich
Copy link
Member

I meant hooks in the general sense as in places where we invoke the workload specific function, that would be one in the webhook and one in the reconciler.

In that case, users should take JobSet controller and re-build image for reconciler to support such execution, right?
Or we will contribute such extensions to the upstream ?

@andreyvelich
Copy link
Member

/assign @sandipanpanda

@andreyvelich
Copy link
Member

Thanks to @sandipanpanda for implementing JAXJob support in Training Operator V1: https://www.kubeflow.org/docs/components/training/user-guides/jax/ 🎉

We are planning to implement JAX as a training runtime in the Training Operator V2, as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.