Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JAX controller #2194

Merged
merged 2 commits into from
Sep 20, 2024
Merged

Conversation

sandipanpanda
Copy link
Member

@sandipanpanda sandipanpanda commented Aug 5, 2024

What this PR does / why we need it:
Implement JAX controller

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fix: #1619
Ref:#2145

/area gsoc

@sandipanpanda
Copy link
Member Author

@coveralls
Copy link

coveralls commented Aug 5, 2024

Pull Request Test Coverage Report for Build 10960403998

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 10937611143: 0.0%
Covered Lines: 66
Relevant Lines: 66

💛 - Coveralls

@shravan-achar
Copy link

shravan-achar commented Aug 14, 2024

@sandipanpanda - Is this PR ready for review?

@tenzen-y
Copy link
Member

@sandipanpanda - Is this PR ready for review?

Actually, no. @sandipanpanda continues trying some implementations.
Regarding the remaining implementations, you can find the weekly sync documentation.

@sandipanpanda
Copy link
Member Author

Adding jaxjob webhook test, examples and finishing up some work on test_e2e_jaxjob.py remain. Can you please share your input if the current implementation up until now is in the correct direction?

@andreyvelich
Copy link
Member

@sandipanpanda If this PR is ready for review, please review the WIP from the PR title.

@sandipanpanda sandipanpanda force-pushed the jax-controller branch 3 times, most recently from a5e554b to 4c266aa Compare August 28, 2024 20:03
@sandipanpanda sandipanpanda changed the title [WIP] Add JAX controller Add JAX controller Aug 28, 2024
examples/jax/cpu-demo/train.py Show resolved Hide resolved
examples/jax/cpu-demo/train.py Outdated Show resolved Hide resolved
examples/jax/cpu-demo/demo.yaml Outdated Show resolved Hide resolved
examples/jax/cpu-demo/demo.yaml Outdated Show resolved Hide resolved
examples/jax/cpu-demo/demo.yaml Outdated Show resolved Hide resolved
pkg/controller.v1/jax/jaxjob_controller.go Outdated Show resolved Hide resolved
pkg/controller.v1/jax/jaxjob_controller.go Outdated Show resolved Hide resolved
pkg/controller.v1/jax/jaxjob_controller.go Outdated Show resolved Hide resolved
pkg/controller.v1/jax/jaxjob_controller.go Show resolved Hide resolved
pkg/controller.v1/jax/jaxjob_controller.go Outdated Show resolved Hide resolved
@tenzen-y
Copy link
Member

@sandipanpanda Could you address this error? We need to pass the appropriate arguments to e2e testing.

DEBUG kubernetes.client.rest:rest.py:235 response body: FATAL Flags parsing error:
flag --job_name=None: Flag --job_name must have a value other than None.
flag --sub_domain=None: Flag --sub_domain must have a value other than None.
flag --coordinator_port=None: Flag --coordinator_port must have a value other than None.
Pass --helpshort or --helpfull to see help on flags.

DEBUG kubernetes.client.rest:rest.py:235 response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Success","details":{"name":"jaxjob-cpu-ci-test","group":"kubeflow.org","kind":"jaxjobs","uid":"f7402a7c-588a-4b9d-903d-969ba0d4c7e2"}}

https://github.com/kubeflow/training-operator/actions/runs/10753814852/job/29823692795?pr=2194#step:9:5832

@sandipanpanda sandipanpanda force-pushed the jax-controller branch 2 times, most recently from a3303b7 to f957d69 Compare September 10, 2024 20:17
@sandipanpanda
Copy link
Member Author

cc @tenzen-y PTAL

examples/jax/cpu-demo/demo.yaml Outdated Show resolved Hide resolved
examples/jax/cpu-demo/Dockerfile Show resolved Hide resolved
examples/jax/cpu-demo/Dockerfile Show resolved Hide resolved
pkg/controller.v1/jax/envvar.go Show resolved Hide resolved
pkg/controller.v1/jax/jaxjob_controller_test.go Outdated Show resolved Hide resolved
sdk/python/kubeflow/training/constants/constants.py Outdated Show resolved Hide resolved
sdk/python/test/e2e/test_e2e_jaxjob.py Show resolved Hide resolved
@sandipanpanda sandipanpanda force-pushed the jax-controller branch 3 times, most recently from 9effb4a to 7c0563d Compare September 18, 2024 15:27
Add JAX controller, controller tests, webhook validations, examples,
e2e tests for JAXJob

Extend the Training Operator Python SDK to simplify the creation and
management of JAXJob resources.

Signed-off-by: Sandipan Panda <[email protected]>
pkg/controller.v1/jax/envvar_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/jax/envvar_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/jax/envvar.go Show resolved Hide resolved
pkg/controller.v1/jax/envvar_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/jax/envvar_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/jax/envvar_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/jax/envvar_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/jax/envvar_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/jax/envvar_test.go Outdated Show resolved Hide resolved
Signed-off-by: Sandipan Panda <[email protected]>
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for a lot of work!
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 8285aff into kubeflow:master Sep 20, 2024
40 checks passed
@sandipanpanda
Copy link
Member Author

Thank you for your unwavering guidance and support throughout this project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(framework): Support JAX
6 participants