Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update links for Kubeflow Training Operator #3002

Merged

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Oct 5, 2021

After this PR: kubeflow/training-operator#1348, we renamed the repo to training-operator.
I tried to update all corresponding links and update some legacy docs for Training.

Please take a look.

I didn't update References docs, do we want to keep them ?
We have some scripts to generate it: https://github.com/kubeflow/website/tree/master/gen-api-reference.

/assign @kubeflow/wg-training-leads @shannonbradshaw

@shannonbradshaw
Copy link
Contributor

Thanks for submitting this PR, @andreyvelich! I will review first thing in my morning tomorrow.

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Contributor

@shannonbradshaw shannonbradshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested corrections in few files, otherwise:

/lgtm

With using volcano scheduler to apply gang-scheduling, a job can run only if there are enough resources for all the pods of the job. Otherwise, all the pods will be in pending state waiting for enough resources. For example, if a job requiring N pods is created and there are only enough resources to schedule N-2 pods, then N pods of the job will stay pending.

**Note:** when in a high workload, if a pod of the job dies when the job is still running, it might give other pods chance to occupied the resources and cause deadlock.
**Note:** when in a high workload, if a pod of the job dies when the job is still running, it might give other pods chance to occupied the resources and cause deadlock.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of typos here. Fixes in bold below
...when in a high workload, if a pod of the job dies when the job is still running, it might give other pods a chance to occupy the resources and cause deadlock.

```

Before you use the auto-tuning example, there is some preparatory work need to be finished in advance.
To let TVM tune your network, you should create a docker image which has TVM module.
Then, you need a auto-tuning script to specify which network will be tuned and set the auto-tuning parameters.
For more details, please see [tutorials](https://docs.tvm.ai/tutorials/autotvm/tune_relay_mobile_gpu.html#sphx-glr-tutorials-autotvm-tune-relay-mobile-gpu-py).
Finally, you need a startup script to start the auto-tuning program. In fact, mxnet-operator will set all the parameters as environment variables and the startup script need to reed these variable and then transmit them to auto-tuning script.
Finally, you need a startup script to start the auto-tuning program. In fact, MXJob will set all the parameters as environment variables and the startup script need to reed these variable and then transmit them to auto-tuning script.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes in bold below.
Finally, you need a startup script to start the auto-tuning program. In fact, MXJob will set all the parameters as environment variables and the startup script needs to read these variable and then transmit them to the auto-tuning script.

You can create XGBoost Job by defining a XGboostJob config file. See the manifests for the [IRIS example](https://github.com/kubeflow/tf-operator/blob/master/examples/xgboost/xgboostjob.yaml). You may change the config file based on your requirements. eg: add `CleanPodPolicy` in Spec to `None` to retain pods after job termination.
You can create a training job by defining a `XGboostJob` config file. See the
manifests for the [IRIS example](https://github.com/kubeflow/training-operator/blob/master/examples/xgboost/xgboostjob.yaml).
You may change the config file based on your requirements. eg: add `CleanPodPolicy`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eg: -> E.g.,

@andreyvelich
Copy link
Member Author

Thank you for the review @shannonbradshaw!
I made these changes.

@thesuperzapper
Copy link
Member

FYI, the PR #3014 restructures the Components / Training Operators section to allow for moving References from the top level.

I don't mind who merges first, but someone will have to rebase.

@shannonbradshaw
Copy link
Contributor

/lgtm

@kimwnasptd
Copy link
Member

/lgtm

@andreyvelich
Copy link
Member Author

andreyvelich commented Oct 7, 2021

FYI, the PR #3014 restructures the Components / Training Operators section to allow for moving References from the top level.

I don't mind who merges first, but someone will have to rebase.

I think we can remove reference for Training Operators in the future, since we keep them here: https://github.com/kubeflow/training-operator/tree/master/docs/api.

WDYT @kubeflow/wg-training-leads ?

@terrytangyuan
Copy link
Member

I think we can remove reference for Training Operators in the future, since we keep them here: https://github.com/kubeflow/training-operator/tree/master/docs/api.

WDYT @kubeflow/wg-training-leads ?

Yes. It's easy to get outdated on the website. Also there's an issue to autogenerate the docs #1924.

@andreyvelich andreyvelich force-pushed the fix-training-operator-links branch from 3d46599 to 4162db0 Compare October 8, 2021 11:40
@andreyvelich
Copy link
Member Author

I think we can remove reference for Training Operators in the future, since we keep them here: https://github.com/kubeflow/training-operator/tree/master/docs/api.
WDYT @kubeflow/wg-training-leads ?

Yes. It's easy to get outdated on the website. Also there's an issue to autogenerate the docs #1924.

Sounds good.
@kubeflow/wg-training-leads If you are fine with these changes, I think we can merge this PR.

@andreyvelich
Copy link
Member Author

@Bobgy @james-jwu @zijianjoy Please can you help with this PR approval ?

@james-jwu
Copy link

/lgtm
/approve

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, james-jwu, shannonbradshaw, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-robot google-oss-robot merged commit 63399c1 into kubeflow:master Oct 9, 2021
@andreyvelich andreyvelich deleted the fix-training-operator-links branch October 9, 2021 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants