[CI] Upgrade unix gpu toolchain #18186

ChaiBapchya · 2020-04-28T16:04:54Z

Description

Currently, Unix GPU & Centos GPU tests use P3 & G3 AWS EC2 instances.
In an effort to improve the cost & efficiency, switch to G4 EC2 instances has been proposed.

This switch involves upgrading the GPU toolchain broadly

Host Machine	Old	New
Ubuntu LTS	16.04.3	18.04.3
Tesla Driver	M60	T4
EC2 Instance Type	G3	G4
Docker	18.09	19.03
NVidia Driver	418.56	440.33.01
Cuda Driver	10.1	10.2

Code Changes

Latest Docker [19.03] has built-in cuda support [hence replace nvidia-docker with docker --gpus all]
Given that the host machine has updated drivers, TVM Op shouldn't need cuda compat [/usr/local/cuda/compat]
replacing ubuntu_gpu_cu101 with ubuntu_build_cuda
Docker compose follows multi-stage build [https://docs.docker.com/develop/develop-images/multistage-build/] and defines multiple targets
ubuntu_build_cuda target is gpuwithcudaruntimelibs
ubuntu_gpu_cu101 target is : gpuwithcompatenv [which has been commented out now]
~~After testing this on CI Dev account : http://jenkins.mxnet-ci-dev.amazon-ml.com/blue/organizations/jenkins/mxnet-validation-bapac%2Funix-gpu/detail/update_gpu_toolchain/8/pipeline~~
~~The TVMOpError related to Binary Ops was encountered : TVMOp doesn't work well with GPU builds #17840~~
~~To unblock the migration from G3 to G4, these flaky tests have been skipped.~~

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Code is well-documented:
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Comments

Thanks to @ptrendx for the help identifying libcuda compat as the rootcause for

CUDA: Check failed: e == cudaSuccess (803 vs. 0) : system has unsupported display driver / cuda driver combination

Helped me close : NVIDIA/nvidia-docker#1256

Thanks to @leezu and @josephevans throughout this migration effort and @sandeep-krishnamurthy @szha for the guidance.

mxnet-bot · 2020-04-28T16:04:59Z

Hey @ChaiBapchya , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [unix-gpu, clang, website, sanity, edge, centos-gpu, windows-cpu, unix-cpu, windows-gpu, centos-cpu, miscellaneous]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

ci/jenkins/Jenkins_steps.groovy

ChaiBapchya · 2020-04-28T22:27:17Z

Rebased to fix windows-gpu issue : fixed in #18177

…ad of cuda compat

tests/python/unittest/test_numpy_op.py

This reverts commit 1c720fa.

ChaiBapchya · 2020-05-10T07:40:27Z

@mxnet-bot run ci [windows-gpu]

assertion failed for test_np_mixed_precision_binary_funcs : Likely flaky

mxnet-bot · 2020-05-10T07:40:34Z

Jenkins CI successfully triggered : [windows-gpu]

ChaiBapchya · 2020-05-11T16:59:26Z

Infra related changes : apache/mxnet-ci#20
Specifically

Updated auto-scaling lambda with additional instance node label : mxnetlinux-gpu-g4
Update env variables corresponding to the instances with G4 node label specific information

    LAUNCH_TEMPLATES : "mxnetlinux-gpu-g4":{"id":"lt-0ebf575cc5a56ebf4","version":"1"}
    EXECUTORS_PER_LABEL : "mxnetlinux-gpu-g4":1
    WARM_POOL_SIZE : mxnetlinux-gpu-g4":0
    MINIMUM_QUEUE_TIMES_SEC : mxnetlinux-gpu-g4":30
    CCACHE_EFS_DNS : "mxnetlinux-gpu-g4":"NONE"
    MAXIMUM_STARTUP_TIME_SEC : "mxnetlinux-gpu-g4":300
    MANAGED_JENKINS_NODE_LABELS : "mxnetlinux-gpu-g4"

Specific commit : apache/mxnet-ci@1a537af

Manually, created a launch template for G4 node pointing to the AMI [created in dev accessible to prod account] [followed steps mentioned here : https://cwiki.apache.org/confluence/display/MXNET/Setup#Setup-Slave]

leezu

Thank you!

ChaiBapchya · 2020-05-11T18:27:01Z

test_np_mixed_precision_binary_funcs
which previously failed on windows-gpu Python 3: GPU Win on multiple occasions
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-18186/8/pipeline
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-18186/7/pipeline

Now fails on unix-gpu : Python3: MKLDNN-GPU-NOCUDNN
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18186/8/pipeline/387

Related issue : #16848

* update nvidiadocker command & remove cuda compat * replace cu101 with cuda since compat is no longer to be used * skip flaky tests * get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu instead of cuda compat * Revert "skip flaky tests" This reverts commit 1c720fa. * revert removal of ubuntu_build_cuda * add linux gpu g4 node to all steps using g3 in unix-gpu pipeline

* Update unix gpu toolchain (#18186) * update nvidiadocker command & remove cuda compat * replace cu101 with cuda since compat is no longer to be used * skip flaky tests * get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu instead of cuda compat * Revert "skip flaky tests" This reverts commit 1c720fa. * revert removal of ubuntu_build_cuda * add linux gpu g4 node to all steps using g3 in unix-gpu pipeline * remove docker compose files * add back the caffe test since caffe is deprecated for mx2.0 and not 1.x * drop nvidia-docker requirement since docker19.0 supports it by default :q * remove compat from dockerfile * Cherry-pick #18635 to v1.7.x (#18935) * Remove mention of nightly in pypi (#18635) * update bert dev.tsv link Co-authored-by: Sheng Zha <[email protected]> * disable tvm in CI functions that rely on libcuda compat * tvm off for ubuntu_gpu_cmake build * drop tvm from all unix-gpu builds Co-authored-by: Carin Meier <[email protected]> Co-authored-by: Sheng Zha <[email protected]>

ChaiBapchya requested review from aaronmarkham and marcoabreu as code owners April 28, 2020 16:04

leezu reviewed Apr 28, 2020

View reviewed changes

ci/jenkins/Jenkins_steps.groovy Outdated Show resolved Hide resolved

ChaiBapchya added 3 commits April 28, 2020 15:26

update nvidiadocker command & remove cuda compat

fe1fef4

replace cu101 with cuda since compat is no longer to be used

f761b24

skip flaky tests

1c720fa

ChaiBapchya force-pushed the update_unix_gpu_toolchain branch from f8331ec to 1c720fa Compare April 28, 2020 22:26

ChaiBapchya marked this pull request as draft April 28, 2020 23:48

get rid of ubuntu_build_cuda and point ubuntu_cu101 to base gpu inste…

ec5330d

…ad of cuda compat

ChaiBapchya force-pushed the update_unix_gpu_toolchain branch from 022e135 to ec5330d Compare April 28, 2020 23:51

leezu reviewed May 1, 2020

View reviewed changes

tests/python/unittest/test_numpy_op.py Outdated Show resolved Hide resolved

ChaiBapchya added 4 commits May 7, 2020 09:55

Revert "skip flaky tests"

36f5563

This reverts commit 1c720fa.

Merge branch 'master' into update_unix_gpu_toolchain

b2428b9

revert removal of ubuntu_build_cuda

386c42b

add linux gpu g4 node to all steps using g3 in unix-gpu pipeline

2b05566

ChaiBapchya marked this pull request as ready for review May 10, 2020 07:02

ChaiBapchya requested a review from leezu May 11, 2020 17:43

leezu approved these changes May 11, 2020

View reviewed changes

josephevans approved these changes May 11, 2020

View reviewed changes

leezu merged commit 21899f8 into apache:master May 12, 2020

This was referenced Jun 14, 2020

Change instance type in CI to use G4 instance #17804

Closed

Test update toolchain for unix gpu ChaiBapchya/mxnet#53

Closed

ChaiBapchya changed the title ~~Update unix gpu toolchain~~ [CI] Upgrade unix gpu toolchain Jul 24, 2020

ChaiBapchya mentioned this pull request Jul 24, 2020

[CI][1.x] Cherrypick: Upgrade unix gpu toolchain (#18186) #18785

Merged

ChaiBapchya deleted the update_unix_gpu_toolchain branch August 11, 2020 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Upgrade unix gpu toolchain #18186

[CI] Upgrade unix gpu toolchain #18186

ChaiBapchya commented Apr 28, 2020 •

edited

Loading

mxnet-bot commented Apr 28, 2020

ChaiBapchya commented Apr 28, 2020

ChaiBapchya commented May 10, 2020

mxnet-bot commented May 10, 2020

ChaiBapchya commented May 11, 2020

leezu left a comment

ChaiBapchya commented May 11, 2020 •

edited

Loading

[CI] Upgrade unix gpu toolchain #18186

[CI] Upgrade unix gpu toolchain #18186

Conversation

ChaiBapchya commented Apr 28, 2020 • edited Loading

Description

Code Changes

Checklist

Essentials

Comments

mxnet-bot commented Apr 28, 2020

ChaiBapchya commented Apr 28, 2020

ChaiBapchya commented May 10, 2020

mxnet-bot commented May 10, 2020

ChaiBapchya commented May 11, 2020

leezu left a comment

Choose a reason for hiding this comment

ChaiBapchya commented May 11, 2020 • edited Loading

ChaiBapchya commented Apr 28, 2020 •

edited

Loading

ChaiBapchya commented May 11, 2020 •

edited

Loading