Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix build #1479

Merged
merged 8 commits into from
Nov 27, 2022
Merged

Fix build #1479

merged 8 commits into from
Nov 27, 2022

Conversation

yinghai
Copy link

@yinghai yinghai commented Nov 24, 2022

How to debug

  1. Follow https://circleci.com/docs/ssh-access-jobs/ to get into the host where test failed (e.g. https://app.circleci.com/pipelines/github/pytorch/TensorRT/1461/workflows/877d58af-c765-48df-b6c0-e6328723a7fd/jobs/6968)
  2. Repeat the failed command in the circleci host: python3 -c "import torch_tensorrt; torch_tensorrt.dump_build_info()" and see that issue can be reproduced.
  3. Repeat with LD_DEBUG=libs python3 -c "import torch_tensorrt; torch_tensorrt.dump_build_info()" and locate the torch libraries at /opt/circleci/.pyenv/versions/3.9.4/lib/python3.9/site-packages/torch/lib/libc10_cuda.so.
  4. Do nm -C /opt/circleci/.pyenv/versions/3.9.4/lib/python3.9/site-packages/torch/lib/libc10_cuda.so | grep CUDACachingAllocator |grep T and noticed that it seems weird as it doesn't contain the symbol c10::cuda::CUDACachingAllocator::allocator. It seems that the libs are not from torch 1.14 nightly.
  5. Check the torch version python3 -c "import torch; print(torch.__version__)" and it showed 1.13.0+cu117. This is very weird as we are supposed to install 1.14.0+cu116.
  6. Check the CI pipeline at noticed that during the Install torch-tensorrt stage, we are actually uninstalling the previously installed torch 1.14

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 0.14.0.dev20221114+cu116 requires torch==1.14.0.dev20221114, but you have torch 1.13.0 which is incompatible.

  1. Check py/setup.py and find that we are forcing torch version to be "torch>=1.13.0.dev0,<1.14.0". Fix that.

Type of change

Please delete options that are not relevant and/or add your own.

  • Bug fix (non-breaking change which fixes an issue)

Checklist:

  • My code follows the style guidelines of this project (You can use the linters)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas and hacks
  • I have made corresponding changes to the documentation
  • I have added tests to verify my fix or my feature
  • New and existing unit tests pass locally with my changes
  • I have added the relevant labels to my PR in so that relevant reviewers are notified

@github-actions github-actions bot added the component: api [Python] Issues re: Python API label Nov 24, 2022
@github-actions github-actions bot added component: core Issues re: The core compiler component: runtime labels Nov 24, 2022
@github-actions github-actions bot added the component: build system Issues re: Build system label Nov 27, 2022
@yinghai
Copy link
Author

yinghai commented Nov 27, 2022

@narendasan //tests/core/conversion/converters:test_einsum failed. Please help check. Thanks.

Copy link
Contributor

@frank-wei frank-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yinghai for your time on deep diving in this issue. It helps us a lot on unblocking the 1.3 release!

":use_pre_cxx11_abi": ["@libtorch_pre_cxx11_abi//:libtorch"],
"//conditions:default": ["@libtorch//:libtorch"],
":use_pre_cxx11_abi": ["@libtorch_pre_cxx11_abi//:libtorch", "@libtorch_pre_cxx11_abi//:c10_cuda"],
"//conditions:default": ["@libtorch//:libtorch", "@libtorch//:c10_cuda"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you forget to mention in the diagnosis that we also need to link with c10_cuda? :-)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need that. But it doesn't hurt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed component: api [C++] Issues re: C++ API component: api [Python] Issues re: Python API component: build system Issues re: Build system component: core Issues re: The core compiler component: runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants