-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add JAX API #2163
Add JAX API #2163
Conversation
/area gsoc |
Pull Request Test Coverage Report for Build 9912856908Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost lgtm
"k8s.io/apimachinery/pkg/runtime" | ||
) | ||
|
||
func addJAXJobDefaultingFuncs(scheme *runtime.Scheme) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func addJAXJobDefaultingFuncs(scheme *runtime.Scheme) error { | |
func addJAXDefaultingFuncs(scheme *runtime.Scheme) error { |
We usually do not add the Job
suffix in most frameworks, although the XGBoost has the prefix.
Let's keep it simple.
} | ||
|
||
// setJAXJobDefaultPort sets the default ports for jax container. | ||
func setJAXJobDefaultPort(spec *corev1.PodSpec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you eliminate the suffix, Job
in all functions in this defaulter file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @sandipanpanda!
// JAXJobFrameworkName is the name of the ML Framework | ||
JAXJobFrameworkName = "jax" | ||
// JAXJobReplicaTypeCoordinator is the type of Coordinator of distributed JAX | ||
JAXJobReplicaTypeCoordinator ReplicaType = "Coordinator" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed today, we don't need to have separate replica for JAX, since Pod template will be the same and Worker with index 0 will the Coordinator.
Does it sound good to you @tenzen-y ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with v1 API since we can not apply separate runPolicies only to coordinator.
In the v2 API, I believe that dedicated replica for the coordinator would be worth it since we often want to treat the coordinator runPolicies.
But, v2 API is a separate discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense @tenzen-y. Do you think, setting various runPolicy for JAX Coordinator and Workers makes sense even that JAX follows SPMD model: https://jax.readthedocs.io/en/latest/multi_process.html#launching-jax-processes ?
E.g. all JAX hosts should be always running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the v1 Training API, we don't have any ability to configure runPolicy on each replicas (coordinator and worker).
So, let's discuss if we should add the Coordinator role in the v2 API.
Support JAXJob Signed-off-by: Sandipan Panda <[email protected]>
I have updated the PR. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM to me. Pretty straightforward. Looking forward to the controller implementation!
@sandipanpanda Is this still working in progress? |
@tenzen-y I need to add the controller part, should that be in a separate PR? |
Yes, I think, we should implement the controller logic as followup PRs. |
Yes, I imagined that the controller and examples are implemented in the other PRs. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Integrate JAX with Kubeflow Training Operator to support JAXJob
Ref: #1619 #2145
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #
Checklist: