Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Experiment with CI cudnn versions [Do not merge] #15847

Closed
wants to merge 18 commits into from

Conversation

DickJC123
Copy link
Contributor

Description

While trying to track down rnn test flakiness that surfaced in PR#15741, I noticed that the CI for the unix-gpu runners is running against cudnn 7.5, while the libmxnet.so was built against 7.6. This PR is an attempt to see if the flakiness is correlated with the cudnn lib version.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@zixuanweeei
Copy link
Contributor

Seems CI was stuck. And I have resolved several conflicts in PR #15741. Would you mind rebasing on it and triggering CI again? Thanks.

BTW, I also found that test_operator_gpu.test_laop_6 and test_operator_gpu.test_convolution_multiple_streams have failed once or twice. I have no knowledge of what caused these problems.

@zixuanweeei
Copy link
Contributor

zixuanweeei commented Aug 20, 2019

Any update on this or CI versions irrelevant, pls?

@DickJC123
Copy link
Contributor Author

Thanks for your patience on this. I began working on the issues of your PR with the commit 'bump cudnn version..." and found that the CI passed for the platform that was causing you trouble. However, I could not get a passing CI due to a number of issues. Those issues I finally fixed with merged PR #15922. I am currently out-of-the-office this week. If you don't mind pursuing a remedy on your own, I suggest you re-merge your PR with master (to pick up 15922), then cherry-pick my 'bump cudnn version..." commit. Please let me know if that stabilizes your PR.

@zixuanweeei
Copy link
Contributor

@DickJC123 Sure. I will have a try. Thanks for your reply.

@anirudhacharya
Copy link
Member

@mxnet-label-bot add [pr-awaiting-review]

@marcoabreu marcoabreu added the pr-awaiting-review PR is waiting for code review label Aug 26, 2019
@szha szha closed this Sep 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants