Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[CI] Upgrade unix gpu toolchain #18186

Merged
merged 8 commits into from
May 12, 2020

Conversation

ChaiBapchya
Copy link
Contributor

@ChaiBapchya ChaiBapchya commented Apr 28, 2020

Description

Currently, Unix GPU & Centos GPU tests use P3 & G3 AWS EC2 instances.
In an effort to improve the cost & efficiency, switch to G4 EC2 instances has been proposed.

This switch involves upgrading the GPU toolchain broadly

Host Machine Old New
Ubuntu LTS 16.04.3 18.04.3
Tesla Driver M60 T4
EC2 Instance Type G3 G4
Docker 18.09 19.03
NVidia Driver 418.56 440.33.01
Cuda Driver 10.1 10.2

Code Changes

  1. Latest Docker [19.03] has built-in cuda support [hence replace nvidia-docker with docker --gpus all]

  2. Given that the host machine has updated drivers, TVM Op shouldn't need cuda compat [/usr/local/cuda/compat]

  3. replacing ubuntu_gpu_cu101 with ubuntu_build_cuda
    Docker compose follows multi-stage build [https://docs.docker.com/develop/develop-images/multistage-build/] and defines multiple targets
    ubuntu_build_cuda target is gpuwithcudaruntimelibs
    ubuntu_gpu_cu101 target is : gpuwithcompatenv [which has been commented out now]

  4. After testing this on CI Dev account : http://jenkins.mxnet-ci-dev.amazon-ml.com/blue/organizations/jenkins/mxnet-validation-bapac%2Funix-gpu/detail/update_gpu_toolchain/8/pipeline
    The TVMOpError related to Binary Ops was encountered : TVMOp doesn't work well with GPU builds #17840
    To unblock the migration from G3 to G4, these flaky tests have been skipped.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Code is well-documented:
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Comments

Thanks to @ptrendx for the help identifying libcuda compat as the rootcause for

CUDA: Check failed: e == cudaSuccess (803 vs. 0) : system has unsupported display driver / cuda driver combination

Helped me close : NVIDIA/nvidia-docker#1256

Thanks to @leezu and @josephevans throughout this migration effort and @sandeep-krishnamurthy @szha for the guidance.

@mxnet-bot
Copy link

Hey @ChaiBapchya , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [unix-gpu, clang, website, sanity, edge, centos-gpu, windows-cpu, unix-cpu, windows-gpu, centos-cpu, miscellaneous]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@ChaiBapchya
Copy link
Contributor Author

Rebased to fix windows-gpu issue : fixed in #18177

@ChaiBapchya ChaiBapchya marked this pull request as draft April 28, 2020 23:48
@ChaiBapchya ChaiBapchya marked this pull request as ready for review May 10, 2020 07:02
@ChaiBapchya
Copy link
Contributor Author

@mxnet-bot run ci [windows-gpu]

assertion failed for test_np_mixed_precision_binary_funcs : Likely flaky

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-gpu]

@ChaiBapchya
Copy link
Contributor Author

Infra related changes : apache/mxnet-ci#20
Specifically

  • Updated auto-scaling lambda with additional instance node label : mxnetlinux-gpu-g4
  • Update env variables corresponding to the instances with G4 node label specific information
    LAUNCH_TEMPLATES : "mxnetlinux-gpu-g4":{"id":"lt-0ebf575cc5a56ebf4","version":"1"}
    EXECUTORS_PER_LABEL : "mxnetlinux-gpu-g4":1
    WARM_POOL_SIZE : mxnetlinux-gpu-g4":0
    MINIMUM_QUEUE_TIMES_SEC : mxnetlinux-gpu-g4":30
    CCACHE_EFS_DNS : "mxnetlinux-gpu-g4":"NONE"
    MAXIMUM_STARTUP_TIME_SEC : "mxnetlinux-gpu-g4":300
    MANAGED_JENKINS_NODE_LABELS : "mxnetlinux-gpu-g4"

Specific commit : apache/mxnet-ci@1a537af

Manually, created a launch template for G4 node pointing to the AMI [created in dev accessible to prod account] [followed steps mentioned here : https://cwiki.apache.org/confluence/display/MXNET/Setup#Setup-Slave]

@ChaiBapchya ChaiBapchya requested a review from leezu May 11, 2020 17:43
Copy link
Contributor

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@ChaiBapchya
Copy link
Contributor Author

ChaiBapchya commented May 11, 2020

@leezu leezu merged commit 21899f8 into apache:master May 12, 2020
AntiZpvoh pushed a commit to AntiZpvoh/incubator-mxnet that referenced this pull request Jul 6, 2020
* update nvidiadocker command & remove cuda compat

* replace cu101 with cuda since compat is no longer to be used

* skip flaky tests

* get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu instead of cuda compat

* Revert "skip flaky tests"

This reverts commit 1c720fa.

* revert removal of ubuntu_build_cuda

* add linux gpu g4 node to all steps using g3 in unix-gpu pipeline
ChaiBapchya added a commit to ChaiBapchya/mxnet that referenced this pull request Jul 24, 2020
* update nvidiadocker command & remove cuda compat

* replace cu101 with cuda since compat is no longer to be used

* skip flaky tests

* get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu instead of cuda compat

* Revert "skip flaky tests"

This reverts commit 1c720fa.

* revert removal of ubuntu_build_cuda

* add linux gpu g4 node to all steps using g3 in unix-gpu pipeline
@ChaiBapchya ChaiBapchya changed the title Update unix gpu toolchain [CI] Upgrade unix gpu toolchain Jul 24, 2020
@ChaiBapchya ChaiBapchya deleted the update_unix_gpu_toolchain branch August 11, 2020 08:41
ChaiBapchya added a commit to ChaiBapchya/mxnet that referenced this pull request Aug 15, 2020
* update nvidiadocker command & remove cuda compat

* replace cu101 with cuda since compat is no longer to be used

* skip flaky tests

* get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu instead of cuda compat

* Revert "skip flaky tests"

This reverts commit 1c720fa.

* revert removal of ubuntu_build_cuda

* add linux gpu g4 node to all steps using g3 in unix-gpu pipeline
ChaiBapchya added a commit to ChaiBapchya/mxnet that referenced this pull request Aug 15, 2020
* update nvidiadocker command & remove cuda compat

* replace cu101 with cuda since compat is no longer to be used

* skip flaky tests

* get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu instead of cuda compat

* Revert "skip flaky tests"

This reverts commit 1c720fa.

* revert removal of ubuntu_build_cuda

* add linux gpu g4 node to all steps using g3 in unix-gpu pipeline
ChaiBapchya added a commit to ChaiBapchya/mxnet that referenced this pull request Aug 15, 2020
* update nvidiadocker command & remove cuda compat

* replace cu101 with cuda since compat is no longer to be used

* skip flaky tests

* get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu instead of cuda compat

* Revert "skip flaky tests"

This reverts commit 1c720fa.

* revert removal of ubuntu_build_cuda

* add linux gpu g4 node to all steps using g3 in unix-gpu pipeline
szha added a commit that referenced this pull request Aug 18, 2020
* Update unix gpu toolchain (#18186)

* update nvidiadocker command & remove cuda compat

* replace cu101 with cuda since compat is no longer to be used

* skip flaky tests

* get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu instead of cuda compat

* Revert "skip flaky tests"

This reverts commit 1c720fa.

* revert removal of ubuntu_build_cuda

* add linux gpu g4 node to all steps using g3 in unix-gpu pipeline

* remove docker compose files

* add back the caffe test since caffe is deprecated for mx2.0 and not 1.x

* drop nvidia-docker requirement since docker19.0 supports it by default

:q

* remove compat from dockerfile

* Cherry-pick #18635 to v1.7.x (#18935)

* Remove mention of nightly in pypi (#18635)

* update bert dev.tsv link

Co-authored-by: Sheng Zha <[email protected]>

* disable tvm in CI functions that rely on libcuda compat

* tvm off for ubuntu_gpu_cmake build

* drop tvm from all unix-gpu builds

Co-authored-by: Carin Meier <[email protected]>
Co-authored-by: Sheng Zha <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

system has unsupported display driver / cuda driver combination
4 participants