-
Notifications
You must be signed in to change notification settings - Fork 781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update links for Kubeflow Training Operator #3002
Update links for Kubeflow Training Operator #3002
Conversation
Thanks for submitting this PR, @andreyvelich! I will review first thing in my morning tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested corrections in few files, otherwise:
/lgtm
With using volcano scheduler to apply gang-scheduling, a job can run only if there are enough resources for all the pods of the job. Otherwise, all the pods will be in pending state waiting for enough resources. For example, if a job requiring N pods is created and there are only enough resources to schedule N-2 pods, then N pods of the job will stay pending. | ||
|
||
**Note:** when in a high workload, if a pod of the job dies when the job is still running, it might give other pods chance to occupied the resources and cause deadlock. | ||
**Note:** when in a high workload, if a pod of the job dies when the job is still running, it might give other pods chance to occupied the resources and cause deadlock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of typos here. Fixes in bold below
...when in a high workload, if a pod of the job dies when the job is still running, it might give other pods a chance to occupy the resources and cause deadlock.
``` | ||
|
||
Before you use the auto-tuning example, there is some preparatory work need to be finished in advance. | ||
To let TVM tune your network, you should create a docker image which has TVM module. | ||
Then, you need a auto-tuning script to specify which network will be tuned and set the auto-tuning parameters. | ||
For more details, please see [tutorials](https://docs.tvm.ai/tutorials/autotvm/tune_relay_mobile_gpu.html#sphx-glr-tutorials-autotvm-tune-relay-mobile-gpu-py). | ||
Finally, you need a startup script to start the auto-tuning program. In fact, mxnet-operator will set all the parameters as environment variables and the startup script need to reed these variable and then transmit them to auto-tuning script. | ||
Finally, you need a startup script to start the auto-tuning program. In fact, MXJob will set all the parameters as environment variables and the startup script need to reed these variable and then transmit them to auto-tuning script. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes in bold below.
Finally, you need a startup script to start the auto-tuning program. In fact, MXJob will set all the parameters as environment variables and the startup script needs to read these variable and then transmit them to the auto-tuning script.
You can create XGBoost Job by defining a XGboostJob config file. See the manifests for the [IRIS example](https://github.com/kubeflow/tf-operator/blob/master/examples/xgboost/xgboostjob.yaml). You may change the config file based on your requirements. eg: add `CleanPodPolicy` in Spec to `None` to retain pods after job termination. | ||
You can create a training job by defining a `XGboostJob` config file. See the | ||
manifests for the [IRIS example](https://github.com/kubeflow/training-operator/blob/master/examples/xgboost/xgboostjob.yaml). | ||
You may change the config file based on your requirements. eg: add `CleanPodPolicy` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eg: -> E.g.,
Thank you for the review @shannonbradshaw! |
FYI, the PR #3014 restructures the I don't mind who merges first, but someone will have to rebase. |
/lgtm |
/lgtm |
I think we can remove reference for Training Operators in the future, since we keep them here: https://github.com/kubeflow/training-operator/tree/master/docs/api. WDYT @kubeflow/wg-training-leads ? |
Yes. It's easy to get outdated on the website. Also there's an issue to autogenerate the docs #1924. |
3d46599
to
4162db0
Compare
Sounds good. |
@Bobgy @james-jwu @zijianjoy Please can you help with this PR approval ? |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, james-jwu, shannonbradshaw, terrytangyuan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
After this PR: kubeflow/training-operator#1348, we renamed the repo to
training-operator
.I tried to update all corresponding links and update some legacy docs for Training.
Please take a look.
I didn't update References docs, do we want to keep them ?
We have some scripts to generate it: https://github.com/kubeflow/website/tree/master/gen-api-reference.
/assign @kubeflow/wg-training-leads @shannonbradshaw