Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JAX API #2163

Merged
merged 1 commit into from
Aug 2, 2024
Merged

Add JAX API #2163

merged 1 commit into from
Aug 2, 2024

Conversation

sandipanpanda
Copy link
Member

What this PR does / why we need it:
Integrate JAX with Kubeflow Training Operator to support JAXJob

Ref: #1619 #2145

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

@sandipanpanda
Copy link
Member Author

sandipanpanda commented Jul 10, 2024

/cc @andreyvelich @tenzen-y

@sandipanpanda
Copy link
Member Author

/area gsoc

@coveralls
Copy link

coveralls commented Jul 10, 2024

Pull Request Test Coverage Report for Build 9912856908

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 4 of 495 (0.81%) changed or added relevant lines in 13 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-1.3%) to 34.082%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/client/informers/externalversions/generic.go 0 2 0.0%
pkg/client/clientset/versioned/typed/kubeflow.org/v1/fake/fake_kubeflow.org_client.go 0 3 0.0%
pkg/client/clientset/versioned/typed/kubeflow.org/v1/kubeflow.org_client.go 0 3 0.0%
pkg/client/informers/externalversions/kubeflow.org/v1/interface.go 0 3 0.0%
pkg/apis/kubeflow.org/v1/zz_generated.defaults.go 0 10 0.0%
pkg/client/informers/externalversions/kubeflow.org/v1/jaxjob.go 0 25 0.0%
pkg/client/listers/kubeflow.org/v1/jaxjob.go 0 25 0.0%
pkg/apis/kubeflow.org/v1/jax_defaults.go 0 30 0.0%
pkg/apis/kubeflow.org/v1/zz_generated.deepcopy.go 0 65 0.0%
pkg/client/clientset/versioned/typed/kubeflow.org/v1/fake/fake_jaxjob.go 0 74 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob.go 1 91.06%
Totals Coverage Status
Change from base Build 9906458669: -1.3%
Covered Lines: 4383
Relevant Lines: 12860

💛 - Coveralls

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost lgtm

"k8s.io/apimachinery/pkg/runtime"
)

func addJAXJobDefaultingFuncs(scheme *runtime.Scheme) error {
Copy link
Member

@tenzen-y tenzen-y Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func addJAXJobDefaultingFuncs(scheme *runtime.Scheme) error {
func addJAXDefaultingFuncs(scheme *runtime.Scheme) error {

We usually do not add the Job suffix in most frameworks, although the XGBoost has the prefix.
Let's keep it simple.

}

// setJAXJobDefaultPort sets the default ports for jax container.
func setJAXJobDefaultPort(spec *corev1.PodSpec) {
Copy link
Member

@tenzen-y tenzen-y Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you eliminate the suffix, Job in all functions in this defaulter file?

pkg/apis/kubeflow.org/v1/jax_defaults.go Outdated Show resolved Hide resolved
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @sandipanpanda!

// JAXJobFrameworkName is the name of the ML Framework
JAXJobFrameworkName = "jax"
// JAXJobReplicaTypeCoordinator is the type of Coordinator of distributed JAX
JAXJobReplicaTypeCoordinator ReplicaType = "Coordinator"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed today, we don't need to have separate replica for JAX, since Pod template will be the same and Worker with index 0 will the Coordinator.
Does it sound good to you @tenzen-y ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with v1 API since we can not apply separate runPolicies only to coordinator.
In the v2 API, I believe that dedicated replica for the coordinator would be worth it since we often want to treat the coordinator runPolicies.

But, v2 API is a separate discussion.

Copy link
Member

@andreyvelich andreyvelich Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense @tenzen-y. Do you think, setting various runPolicy for JAX Coordinator and Workers makes sense even that JAX follows SPMD model: https://jax.readthedocs.io/en/latest/multi_process.html#launching-jax-processes ?
E.g. all JAX hosts should be always running.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the v1 Training API, we don't have any ability to configure runPolicy on each replicas (coordinator and worker).
So, let's discuss if we should add the Coordinator role in the v2 API.

Support JAXJob

Signed-off-by: Sandipan Panda <[email protected]>
@sandipanpanda
Copy link
Member Author

I have updated the PR. PTAL.

@sandipanpanda sandipanpanda changed the title [WIP] Add JAX API Add JAX API Jul 24, 2024
@sandipanpanda sandipanpanda changed the title Add JAX API [WIP] Add JAX API Jul 24, 2024
Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to me. Pretty straightforward. Looking forward to the controller implementation!

@tenzen-y
Copy link
Member

tenzen-y commented Aug 1, 2024

@sandipanpanda Is this still working in progress?

@sandipanpanda
Copy link
Member Author

@sandipanpanda Is this still working in progress?

@tenzen-y I need to add the controller part, should that be in a separate PR?

@andreyvelich
Copy link
Member

@sandipanpanda Is this still working in progress?

@tenzen-y I need to add the controller part, should that be in a separate PR?

Yes, I think, we should implement the controller logic as followup PRs.
@kubeflow/wg-training-leads @shravan-achar Do you have any other comments on the JAXJob API design ?

@sandipanpanda sandipanpanda changed the title [WIP] Add JAX API Add JAX API Aug 1, 2024
@tenzen-y
Copy link
Member

tenzen-y commented Aug 2, 2024

Yes, I imagined that the controller and examples are implemented in the other PRs.

/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants