Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support processing resource types other than GPU #75

Merged
merged 7 commits into from
Jan 15, 2019
Merged

Support processing resource types other than GPU #75

merged 7 commits into from
Jan 15, 2019

Conversation

terrytangyuan
Copy link
Member

@terrytangyuan terrytangyuan commented Jan 10, 2019

This PR adds support for other processing resource types other than NVIDIA GPU such as CPU.

This is still a work-in-progress but just wanted to open this earlier here for discussion. A couple of questions to discuss:

  1. Are processingUnitsPerNode and processingResourceType good names to use?
  2. How do we want to change or deal with backwards compatibility for MPIJobSpec, main.go, and MPIJobController?
  3. Is "nvidia.com/gpu" a good default value for processingResourceType?
  4. Is there anything we want to customize for the slots in hostfile when processingResourceType is "cpu"?

This change is Reviewable

@k8s-ci-robot
Copy link

Hi @terrytangyuan. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@terrytangyuan
Copy link
Member Author

terrytangyuan commented Jan 11, 2019

Per @everpeace, some users using no GPU would like to configure custom slot= clause in hostfile. @everpeace Do you have examples on how customized they want this to be? As long as we expose the rest of the string in the slot clause, it should meet their needs, right? Let me know if there's anything I missed here.

@everpeace
Copy link

everpeace commented Jan 14, 2019

@terrytangyuan Hi, It's good to know you're working on the issue.

Overall implementation looks good to me. If users request both gpus and cpus on each worker, mpi-operator should support customization of slots clause. For example

kind: MPIJob
spec:
  replicast: 3
  template:
    spec:
      containers:
      - image: ...
         resorurces:
           limits:
             cpus: 4
             nvidia.com/gpu: 8

Current implementation put slots=8 to hostfile when processingResourceType=nvidia.com/gpu. If a user wanted to run 4 processes per worker and each process uses 2 gpus, how mpi-operator support this??

So, I think mpi-operator should support customizability of slots= clause in the API. My Initial thought was to simply introduce mpiConig.slotsPerWorker to MpiJob.spec like ChainerJob

kind: MPIJob
spec:
  replicas: 4
  mpiConfig:
    slotsPerWorker: N
   spec:
   ...

What do you think?

@terrytangyuan
Copy link
Member Author

@everpeace That makes sense. I'll expose slotsPerWorker as another parameter unless others have additional comments on this.

@rongou Any comments on the second question regarding backwards compatibility? Perhaps since we just release 0.1.0, we should introduce these new parameters as optional and keep the original GPU specific parameters?

@rongou
Copy link
Member

rongou commented Jan 14, 2019

Yeah just make it optional.

@terrytangyuan terrytangyuan changed the title [WIP] Support processing resource types other than GPU Support processing resource types other than GPU Jan 15, 2019
@terrytangyuan
Copy link
Member Author

@rongou This is ready for review. Though the assigned reviewers are both inactive and have been removed from CODEOWNERS. Maybe @rongou @everpeace you both can review this as you already joined the discussion?

@rongou rongou requested review from everpeace and rongou and removed request for ryanolson and beberg January 15, 2019 01:26
Copy link

@everpeace everpeace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@rongou
Copy link
Member

rongou commented Jan 15, 2019

Overall looks good. But I think you need to update the CRD validation for the new fields: https://github.com/kubeflow/mpi-operator/blob/master/deploy/0-crd.yaml

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jan 15, 2019
@terrytangyuan
Copy link
Member Author

@rongou Thanks. Added validation in the last commit.

required:
- gpus
- properties:
processingUnits:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you also add a check for processingResourceType?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's part of main.go as a flag since I thought it was more related to the cluster's spec where controller gets deployed along with processingUnitsPerNode. The check is done as part of convertProcessingResourceType(). Or maybe this is more MPIJob user specific (if they want to specify custom resource) and I should move it to MPIJobSpec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right. Based on the experience with the gpusPerNode flag, it seems easier for users to understand if we put the options in the CRD. Maybe take them out of the operator flag and just use the job spec?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll keep those flags for now for backwards compatibility and then add those fields in MPIJobSpec that overwrites the flags when specified. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds good.

@rongou
Copy link
Member

rongou commented Jan 15, 2019

/lgtm
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: everpeace, rongou, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants